Large Language Models - From RNN to BERT

ATPowr 484 views 83 slides Nov 18, 2023
Slide 1
Slide 1 of 83
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83

About This Presentation

Large Language Models


Slide Content

Language Models
From RNN to BERT

LI Yanran

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
1!

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
2!

Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
3!
Word Sequence!
Probability!
Language Model!

Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
- Predicting the next word
- e.g. “How are ________? (you) _______”
Previous words word being
(Context) predicted
4!

Estimating Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
- Using Chain Rule
P(w
1
,w
2
,…,w
n
)
= P(w
n
|w
1
,…,w
n-1
) P(w
1
,…,w
n-1
)
= P(w
n
|w
1
,…,w
n-1
) P(w
n-1
|w
1
,…,w
n-2
) P(w
1
,…,w
n-2
)
= …
- Take the example above
P(how are you ?)
= P(how) P(are|how) P(you|how are) P(?|how are you)



5!

Estimating Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
- Using Chain Rule
P(w
1
,w
2
,…,w
n
)
= P(w
n
|w
1
,…,w
n-1
) P(w
1
,…,w
n-1
)
= P(w
n
|w
1
,…,w
n-1
) P(w
n-1
|w
1
,…,w
n-2
) P(w
1
,…,w
n-2
)
= …
Estimate the joint probability of an entire sequence of words
by multiplying together a number of conditional probabilities.


6!

N-gram Language Model
n Markov Assumption
- Limited History: For k >= 0,
P(w
i
|w
1
,…,w
i-k
, …, w
i-1
)
≈ P(w
i
|w
i-k
, …, w
i-1
)
- Estimating the k
th
-order Markov Model
- Instead of computing the probability of a word given its entire
history, we can approximate the history by just the last few
words.

7!

A Bi-gram Example
8!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

N-gram Language Model
n Markov Assumption
- Limited History: For k >= 0,
P(w
i
|w
1
,…,w
i-k
, …, w
i-1
)
≈ P(w
i
|w
i-k
, …, w
i-1
)
- Estimating the k
th
-order Markov Model
- This is problematic!
- Zero counts
- Tradeoff between computations and performances


9!

Evaluating Language Model
n Traditional Intrinsic Metric: Perplexity


- The higher the conditional probability of the word sequence,
the lower the perplexity.
- Training a LM is to minimizing perplexity.


10!

Evaluating Language Model
n Traditional Intrinsic Metric: Perplexity
n Extrinsic v.s. Intrinsic
- Embedding a trained LM in an application and measure how
much the application improves.
- Evaluating the outputs of the down-stream application
models
- Evaluating the internal representations learned by a trained
LM
n A good LM should provide useful features!


11! Language !
Model!

Evaluating Language Model
n Useful features are those for language difficulties
- Contextual Information for Ambiguity
- e.g.
“?\ie??OK?\?n'?o????”
“??8???6”


3!

Evaluating Language Model
n Useful features are those for language difficulties
- Contextual Information for Ambiguity
- e.g.
“?]]?/o*???R? ?jO?'”
“ ??/o*???R6”

3!

Evaluating Language Model
n Useful features are those for language difficulties
- Contextual Information for Ambiguity
- e.g.
“X???·?u S3?Y- `‹W/ò6”
“May may quit in May.”

3!

Evaluating Language Model
n Useful features are those for language difficulties
- Dependency Information for Understanding
- e.g.
“v?w&Ij?Z&1???\D €ÖÈÔ?
“o.&&&/&&1?Z6”

3!

Evaluating Language Model
n Useful features are those for language difficulties
- Dependency Information for Understanding
- e.g.
“oj?Zoj?oj??oj6”
“WOK??H?Z §OK6”

3!

Evaluating Language Model
n Useful features are those for language difficulties
- Dependency Information for Understanding
- e.g.
“?&Dn&D??&DI??6”
“Before was was was, was was is.”

3!

n Useful features are those for language difficulties
- Background Knowledge for Understanding
- e.g.

“÷ûÜÅ·??.??? `6”

Evaluating Language Model
3!

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
19!

Neural LM
n Unlimited History: Recurrent Neural Networks
- The RNNs are neural nets that can deal with sequences of
variable length. They are able to this by defining a recurrence
relation over timesteps:
20!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Neural LM
n Unlimited History: Recurrent Neural Networks
- The activation value of the hidden layer depends on the
current input as well as the activation value of the hidden
layer from the previous time step.
- In this way, Recurrent neural language models (RNNLMs)
avoid the limited context constraint inherent in traditional N-
gram models, since the hidden state embodies information
about all of the preceding words all the way back to the
beginning of the sequence.
21!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Neural LM
n Unlimited History: Recurrent Neural Networks
- The activation value of the hidden layer depends on the
current input as well as the activation value of the hidden
layer from the previous time step:
22!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Neural LM
n Unlimited History: Recurrent Neural Networks
23!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Neural LM
n Unlimited History: Recurrent Neural Networks
24!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

RNN variants
n Gated RNNs
- Credit assignment issue
- Vanishing gradient issue
25!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

RNN Variants
n Gated RNNs
n Bi-directional RNNs
26!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

RNN Features
n Gated RNNs
n Bi-directional RNNs
27!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!

Magic and Reality
n Magic: Flexible
- complex networks can be treated as modules that can be
combined in creative ways.
- Naturally suited for variable-length sequence processing
- Theoretically handling long-term dependencies
n Reality: hard to train
- Long-range dependencies still tricky, despite gating
- Limited contextual information are captured
- Loss of hierarchy
- Sequentiality prohibits parallelization within instances
- Training is slow
28!

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
29!

ELMo
n Embeddings from Language Models: ELMo
- Learn word embeddings through building bidirectional LM
n Propose a new type of deep contextualised word
representations (ELMo) that model:
- Complex characteristics of word use
- e.g., syntax and semantics
- How these uses vary across linguistic contexts
- i.e., to model polysemy
30!

ELMo
n Embeddings from Language Models: ELMo
- Learn word embeddings through building bidirectional LM
n ELMo representations are:
- Contextual: The word representations depends on the entire
context in which it is used.
- Deep: The word representations combine all layers of a deep
pre-trained neural network.
- Task-specific (two-stage):
- The word representation is a linear combination of
corresponding hidden layers.
- A down-stream task learns weighting parameters.
31!

ELMo
32!
Credit: The Illustrated BERT, ELMo, and co. Jay Alammar.!

ELMo
33!
Credit: The Illustrated BERT, ELMo, and co. Jay Alammar.!

Two-stage Feature Ensembling
n Feature-based Pre-training
- ELMo can be integrated to almost all neural NLP tasks with
simple concatenation to the embedding layer
34!
Credit: Deep contextualized word representations. NAACL 2018.!

ELMo Evaluation
n Feature-based Pre-training
- ELMo can be integrated to almost all neural NLP tasks with
simple concatenation to the embedding layer
35!
Credit: Deep contextualized word representations. NAACL 2018.!

ELMo Evaluation
n Feature-based Pre-training
- The higher layer seemed to learn semantics while the lower
layer probably captured syntactic features
36!
Credit: Deep contextualized word representations. NAACL 2018.!

ELMo Features
n ELMo representations are:
- Contextual
- Deep
- Task-specific (two-stage)
n Compared with standard RNNLMs, ELMo
- Empirically learns contextual information more sufficiently
- Captures more abstract linguistic characteristics in the higher
level of layers
- But still hard to parallelize
37!

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
38!

Transformers
n Attention is All You Need
- Transformer is a pure attention-based model consisting of:
- The encoding component is a stack of encoders.
- The decoding component is a stack of decoders of the same
number.
n Transformer Block
- The encoders are identical Transformer Block.
- The encoder architecture is also used in BERT.
39!

Transformers
n Attention is All You Need
- Transformer is a pure attention-based model consisting of:
- The encoding component is a stack of encoders.
- The decoding component is a stack of decoders of the same
number.
n Transformer Block
- The encoders are identical Transformer Block.
- The encoder architecture is also used in BERT.
40!

Transformer Block
41!
Credit: The Illustrated Transformer. Jay Alammar.!

Transformer Block
42!
Credit: The Illustrated Transformer. Jay Alammar.!

Transformer Block
43!
Credit: The Illustrated Transformer. Jay Alammar.!

Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- e.g. Encoder-Decoder Attention
44!

Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- Self-Attention
45!

Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- Convolution v.s. Self-Attention
46!

Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- Self-Attention (similar with convolution)
- Constant path length between any two positions
- Variable-sized perceptive field
- Gating/multiplication enables crisp error propagation
- Trivial to parallelize (per layer)
- Long-distance context has “equal opportunity”
- Self-attention is the method the Transformer uses to bake the
“understanding” of other relevant words into the one we’re
currently processing.
47!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
48!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
49!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Multiplying x
1
by the W
Q
weight matrix produces q
1
, the
"query" vector associated with that word.
- We end up creating a "query", a "key", and a "value"
projection of each word in the input sentence.
50!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Score
51!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Score
52!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Scaled Dot-Product Attention
- Three Abstraction Vectors
- Score
- Weighted Sum
53!

Self-Attention Layer
54!
Credit: The Illustrated Transformer. Jay Alammar.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Score
- Weighted Sum
55!
Credit: Dissecting BERT Part 1: The Encoder.!

Self-Attention Layer
56!
Credit: Dissecting BERT Part 1: The Encoder.!

Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Scaled Dot-Product Attention
- Three Abstraction Vectors
- Score
- Weighted Sum cannot distinguish what information came
from where
- Convolution can: a different linear transformation for each
relative position.
57!

Self-Attention Layer
n Multi-Head Self-Attention
- Multiple attention layers (heads) in parallel (shown by
different colors)
- Each head uses different linear transformations
- Different heads can learn different relationships
58!

Self-Attention Layer
n Multi-Head Self-Attention
59!

Self-Attention Layer
n Multi-Head Self-Attention
8!
Credit: The Illustrated Transformer. Jay Alammar.!
60!

Self-Attention Layer
n Multi-Head Self-Attention
Credit: The Illustrated Transformer. Jay Alammar.!
61!

Self-Attention Layer
62!
Visualized using Tensor2Tensor notebook !

Transformer Evaluation
63!
n Result on WMT-14

Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
64!

BERT
n Bidirectional Encoder Representations from Transformers:
BERT
- BERT ≈ Encoder of Transformer
n Combining the Two Worlds:
- BERT ≈ ELMo + Transformer
- ELMo’s language model was bi-directional, but based on
(slow) RNNs
- Transformer purely relies on (parallelized) attention, but
only trains a forward language model.
- Why: Language models only use left context or right context,
but language understanding is bidirectional.
65!

BERT
n Handling the bi-directionality:
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
66!

BERT
n Handling the bi-directionality:
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
- BERT solves this problem using two “new” paradigms:
- Masked Language Model
- Next Sentence Prediction (Two-Sentence Tasks)
67!

Masked Language Model
n Predict random words from within the sequence
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
- BERT solves this problem using two “new” paradigms:
- Masked Language Model
- Two-sentence Tasks
7!
Credit: The Illustrated BERT. Jay Alammar.!
68!

Masked Language Model
n Predict random words from within the sequence
- In this specific case, 15% of the words that were fed in as
input were masked.
- Too little masking: Too expensive to train
- Too much masking: Not enough context
69!
Credit: The Illustrated BERT. Jay Alammar.!

Masked Language Model
n Predict random words from within the sequence
- In this specific case, 15% of the words that were fed in as
input were masked.
- But not all tokens were masked in the same way:
- 80% of the time, replace with [MASK]
- went to the store → went to the [MASK]
- 10% of the time, replace random word
- went to the store → went to the running
- 10% of the time, keep same
- went to the store → went to the store
70!
Credit: The Illustrated BERT. Jay Alammar.!

Masked Language Model
n Predict random words from within the sequence
- In this specific case, 15% of the words that were fed in as
input were masked.
- But not all tokens were masked in the same way since:
- If the model had been trained on only predicting
‘<MASK>’ tokens and then never saw this token during
fine-tuning, it would have thought that there was no need
to predict anything.
- The model would have only learned a contextual
representation of the ‘<MASK>’ token and this would
have made it learn slowly (since only 15% of the input
tokens are masked).
71!
Credit: The Illustrated BERT. Jay Alammar.!

Next Sentence Prediction
n To learn relationships between sentences
- predict whether Sentence B is actual sentence that proceeds
Sentence A, or a random one
- BERT ≈ Encoder of Transformer with Input Specifics
- [CLS]
- [SEP]
7!
Credit: The Illustrated BERT. Jay Alammar.!

Next Sentence Prediction
n Predict random words from within the sequence
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
- BERT solves this problem using two “new” paradigms:
- Masked Language Model
- Two-sentence Tasks
7!
Credit: The Illustrated BERT. Jay Alammar.!
73!

Two-Stage Fine-Tuning
n Fine-tuning for downstream tasks
- The BERT paper shows a number of ways to use BERT for
different tasks.
74!

BERT Evaluation
75!
n GLUE Results
- 8 GLUE classification tasks were used to assess performance.
- BERT base not only beat OpenAI GPT on all tasks achieving
SOTA, but it improved SOTA by an impressive 5% on
average. BERT large beat BERT base for all tasks as well.

BERT Features
76!
n Intrinsic Representations
- We can also use the BERT contextualized word embeddings.
Credit: The Illustrated BERT. Jay Alammar.!

BERT Features
77!
n Intrinsic Representations
- We can also use the BERT contextualized word embeddings.
Credit: The Illustrated BERT. Jay Alammar.!

BERT Features
78!
n Intrinsic Representations
- An important detail of BERT is the preprocessing used for the
input text: [CLS] [SEP]
Credit: What Does BERT Look At? An Analysis of BERT’s Attention. BlackBoxNLP 2019.!

Outline
n Language Model
n Neural Language Model: RNN
n Contextual Embeddings: ELMo
n Parallelizable: Transformers
n Milestone: BERT!
n What’s Next?
n How to Use BERT?
n How to improve BERT?
n What are Hands-on Experiences?

Syllabus
n Language Model
n Machine Translation
n Sentiment Analysis
n Machine Comprehension
n Knowledge Graph
n Text Generation

Syllabus

Thank you!
Questions?!
Course Registration WeChat Group!