Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
1!
Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
2!
Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
3!
Word Sequence!
Probability!
Language Model!
Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
- Predicting the next word
- e.g. “How are ________? (you) _______”
Previous words word being
(Context) predicted
4!
Estimating Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
- Using Chain Rule
P(w
1
,w
2
,…,w
n
)
= P(w
n
|w
1
,…,w
n-1
) P(w
1
,…,w
n-1
)
= P(w
n
|w
1
,…,w
n-1
) P(w
n-1
|w
1
,…,w
n-2
) P(w
1
,…,w
n-2
)
= …
- Take the example above
P(how are you ?)
= P(how) P(are|how) P(you|how are) P(?|how are you)
5!
Estimating Language Model
n What is a language model
- Estimating the joint probability over word sequence
- e.g. “How are you?”
- Calculate P(w
1
,w
2
,w
3
,w
4
)
- Using Chain Rule
P(w
1
,w
2
,…,w
n
)
= P(w
n
|w
1
,…,w
n-1
) P(w
1
,…,w
n-1
)
= P(w
n
|w
1
,…,w
n-1
) P(w
n-1
|w
1
,…,w
n-2
) P(w
1
,…,w
n-2
)
= …
Estimate the joint probability of an entire sequence of words
by multiplying together a number of conditional probabilities.
6!
N-gram Language Model
n Markov Assumption
- Limited History: For k >= 0,
P(w
i
|w
1
,…,w
i-k
, …, w
i-1
)
≈ P(w
i
|w
i-k
, …, w
i-1
)
- Estimating the k
th
-order Markov Model
- Instead of computing the probability of a word given its entire
history, we can approximate the history by just the last few
words.
7!
A Bi-gram Example
8!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!
N-gram Language Model
n Markov Assumption
- Limited History: For k >= 0,
P(w
i
|w
1
,…,w
i-k
, …, w
i-1
)
≈ P(w
i
|w
i-k
, …, w
i-1
)
- Estimating the k
th
-order Markov Model
- This is problematic!
- Zero counts
- Tradeoff between computations and performances
9!
Evaluating Language Model
n Traditional Intrinsic Metric: Perplexity
- The higher the conditional probability of the word sequence,
the lower the perplexity.
- Training a LM is to minimizing perplexity.
10!
Evaluating Language Model
n Traditional Intrinsic Metric: Perplexity
n Extrinsic v.s. Intrinsic
- Embedding a trained LM in an application and measure how
much the application improves.
- Evaluating the outputs of the down-stream application
models
- Evaluating the internal representations learned by a trained
LM
n A good LM should provide useful features!
11! Language !
Model!
Evaluating Language Model
n Useful features are those for language difficulties
- Contextual Information for Ambiguity
- e.g.
“?\ie??OK?\?n'?o????”
“??8???6”
3!
Evaluating Language Model
n Useful features are those for language difficulties
- Contextual Information for Ambiguity
- e.g.
“?]]?/o*???R? ?jO?'”
“ ??/o*???R6”
3!
Evaluating Language Model
n Useful features are those for language difficulties
- Contextual Information for Ambiguity
- e.g.
“X???·?u S3?Y- `‹W/ò6”
“May may quit in May.”
3!
Evaluating Language Model
n Useful features are those for language difficulties
- Dependency Information for Understanding
- e.g.
“v?w&Ij?Z&1???\D €ÖÈÔ?
“o.&&&/&&1?Z6”
3!
Evaluating Language Model
n Useful features are those for language difficulties
- Dependency Information for Understanding
- e.g.
“oj?Zoj?oj??oj6”
“WOK??H?Z §OK6”
3!
Evaluating Language Model
n Useful features are those for language difficulties
- Dependency Information for Understanding
- e.g.
“?&Dn&D??&DI??6”
“Before was was was, was was is.”
3!
n Useful features are those for language difficulties
- Background Knowledge for Understanding
- e.g.
“÷ûÜÅ·??.??? `6”
Evaluating Language Model
3!
Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
19!
Neural LM
n Unlimited History: Recurrent Neural Networks
- The RNNs are neural nets that can deal with sequences of
variable length. They are able to this by defining a recurrence
relation over timesteps:
20!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!
Neural LM
n Unlimited History: Recurrent Neural Networks
- The activation value of the hidden layer depends on the
current input as well as the activation value of the hidden
layer from the previous time step.
- In this way, Recurrent neural language models (RNNLMs)
avoid the limited context constraint inherent in traditional N-
gram models, since the hidden state embodies information
about all of the preceding words all the way back to the
beginning of the sequence.
21!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!
Neural LM
n Unlimited History: Recurrent Neural Networks
- The activation value of the hidden layer depends on the
current input as well as the activation value of the hidden
layer from the previous time step:
22!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!
Neural LM
n Unlimited History: Recurrent Neural Networks
23!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!
Neural LM
n Unlimited History: Recurrent Neural Networks
24!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!
RNN variants
n Gated RNNs
- Credit assignment issue
- Vanishing gradient issue
25!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!
RNN Variants
n Gated RNNs
n Bi-directional RNNs
26!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!
RNN Features
n Gated RNNs
n Bi-directional RNNs
27!
Credit: Speech and Language Processing. Daniel Jurafsky & James H. Martin.!
Magic and Reality
n Magic: Flexible
- complex networks can be treated as modules that can be
combined in creative ways.
- Naturally suited for variable-length sequence processing
- Theoretically handling long-term dependencies
n Reality: hard to train
- Long-range dependencies still tricky, despite gating
- Limited contextual information are captured
- Loss of hierarchy
- Sequentiality prohibits parallelization within instances
- Training is slow
28!
Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
29!
ELMo
n Embeddings from Language Models: ELMo
- Learn word embeddings through building bidirectional LM
n Propose a new type of deep contextualised word
representations (ELMo) that model:
- Complex characteristics of word use
- e.g., syntax and semantics
- How these uses vary across linguistic contexts
- i.e., to model polysemy
30!
ELMo
n Embeddings from Language Models: ELMo
- Learn word embeddings through building bidirectional LM
n ELMo representations are:
- Contextual: The word representations depends on the entire
context in which it is used.
- Deep: The word representations combine all layers of a deep
pre-trained neural network.
- Task-specific (two-stage):
- The word representation is a linear combination of
corresponding hidden layers.
- A down-stream task learns weighting parameters.
31!
ELMo
32!
Credit: The Illustrated BERT, ELMo, and co. Jay Alammar.!
ELMo
33!
Credit: The Illustrated BERT, ELMo, and co. Jay Alammar.!
Two-stage Feature Ensembling
n Feature-based Pre-training
- ELMo can be integrated to almost all neural NLP tasks with
simple concatenation to the embedding layer
34!
Credit: Deep contextualized word representations. NAACL 2018.!
ELMo Evaluation
n Feature-based Pre-training
- ELMo can be integrated to almost all neural NLP tasks with
simple concatenation to the embedding layer
35!
Credit: Deep contextualized word representations. NAACL 2018.!
ELMo Evaluation
n Feature-based Pre-training
- The higher layer seemed to learn semantics while the lower
layer probably captured syntactic features
36!
Credit: Deep contextualized word representations. NAACL 2018.!
ELMo Features
n ELMo representations are:
- Contextual
- Deep
- Task-specific (two-stage)
n Compared with standard RNNLMs, ELMo
- Empirically learns contextual information more sufficiently
- Captures more abstract linguistic characteristics in the higher
level of layers
- But still hard to parallelize
37!
Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
38!
Transformers
n Attention is All You Need
- Transformer is a pure attention-based model consisting of:
- The encoding component is a stack of encoders.
- The decoding component is a stack of decoders of the same
number.
n Transformer Block
- The encoders are identical Transformer Block.
- The encoder architecture is also used in BERT.
39!
Transformers
n Attention is All You Need
- Transformer is a pure attention-based model consisting of:
- The encoding component is a stack of encoders.
- The decoding component is a stack of decoders of the same
number.
n Transformer Block
- The encoders are identical Transformer Block.
- The encoder architecture is also used in BERT.
40!
Transformer Block
41!
Credit: The Illustrated Transformer. Jay Alammar.!
Transformer Block
42!
Credit: The Illustrated Transformer. Jay Alammar.!
Transformer Block
43!
Credit: The Illustrated Transformer. Jay Alammar.!
Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- e.g. Encoder-Decoder Attention
44!
Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- Self-Attention
45!
Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- Convolution v.s. Self-Attention
46!
Transformer Block
n Attention is All You Need
- Attention: indexing/referring contents using similarity
- Self-Attention (similar with convolution)
- Constant path length between any two positions
- Variable-sized perceptive field
- Gating/multiplication enables crisp error propagation
- Trivial to parallelize (per layer)
- Long-distance context has “equal opportunity”
- Self-attention is the method the Transformer uses to bake the
“understanding” of other relevant words into the one we’re
currently processing.
47!
Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
48!
Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
49!
Credit: The Illustrated Transformer. Jay Alammar.!
Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Multiplying x
1
by the W
Q
weight matrix produces q
1
, the
"query" vector associated with that word.
- We end up creating a "query", a "key", and a "value"
projection of each word in the input sentence.
50!
Credit: The Illustrated Transformer. Jay Alammar.!
Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Score
51!
Credit: The Illustrated Transformer. Jay Alammar.!
Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Score
52!
Credit: The Illustrated Transformer. Jay Alammar.!
Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Scaled Dot-Product Attention
- Three Abstraction Vectors
- Score
- Weighted Sum
53!
Self-Attention Layer
54!
Credit: The Illustrated Transformer. Jay Alammar.!
Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Dot-Product Attention
- Three Abstraction Vectors
- Score
- Weighted Sum
55!
Credit: Dissecting BERT Part 1: The Encoder.!
Self-Attention Layer
56!
Credit: Dissecting BERT Part 1: The Encoder.!
Self-Attention Layer
n Self-Attention
- “understanding” of other relevant words into the current one
n Scaled Dot-Product Attention
- Three Abstraction Vectors
- Score
- Weighted Sum cannot distinguish what information came
from where
- Convolution can: a different linear transformation for each
relative position.
57!
Self-Attention Layer
n Multi-Head Self-Attention
- Multiple attention layers (heads) in parallel (shown by
different colors)
- Each head uses different linear transformations
- Different heads can learn different relationships
58!
Self-Attention Layer
n Multi-Head Self-Attention
59!
Self-Attention Layer
n Multi-Head Self-Attention
8!
Credit: The Illustrated Transformer. Jay Alammar.!
60!
Self-Attention Layer
n Multi-Head Self-Attention
Credit: The Illustrated Transformer. Jay Alammar.!
61!
Self-Attention Layer
62!
Visualized using Tensor2Tensor notebook !
Transformer Evaluation
63!
n Result on WMT-14
Outline
n Language Model
- What, How, Why
n Neural Language Model: RNN
- Magic and Reality
n Contextual Embeddings: ELMo
- Two-stage Ensembling
n Parallelizable: Transformers
- Attention is All You Need
n Milestone: BERT
- ELMo + Transformers!
64!
BERT
n Bidirectional Encoder Representations from Transformers:
BERT
- BERT ≈ Encoder of Transformer
n Combining the Two Worlds:
- BERT ≈ ELMo + Transformer
- ELMo’s language model was bi-directional, but based on
(slow) RNNs
- Transformer purely relies on (parallelized) attention, but
only trains a forward language model.
- Why: Language models only use left context or right context,
but language understanding is bidirectional.
65!
BERT
n Handling the bi-directionality:
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
66!
BERT
n Handling the bi-directionality:
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
- BERT solves this problem using two “new” paradigms:
- Masked Language Model
- Next Sentence Prediction (Two-Sentence Tasks)
67!
Masked Language Model
n Predict random words from within the sequence
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
- BERT solves this problem using two “new” paradigms:
- Masked Language Model
- Two-sentence Tasks
7!
Credit: The Illustrated BERT. Jay Alammar.!
68!
Masked Language Model
n Predict random words from within the sequence
- In this specific case, 15% of the words that were fed in as
input were masked.
- Too little masking: Too expensive to train
- Too much masking: Not enough context
69!
Credit: The Illustrated BERT. Jay Alammar.!
Masked Language Model
n Predict random words from within the sequence
- In this specific case, 15% of the words that were fed in as
input were masked.
- But not all tokens were masked in the same way:
- 80% of the time, replace with [MASK]
- went to the store → went to the [MASK]
- 10% of the time, replace random word
- went to the store → went to the running
- 10% of the time, keep same
- went to the store → went to the store
70!
Credit: The Illustrated BERT. Jay Alammar.!
Masked Language Model
n Predict random words from within the sequence
- In this specific case, 15% of the words that were fed in as
input were masked.
- But not all tokens were masked in the same way since:
- If the model had been trained on only predicting
‘<MASK>’ tokens and then never saw this token during
fine-tuning, it would have thought that there was no need
to predict anything.
- The model would have only learned a contextual
representation of the ‘<MASK>’ token and this would
have made it learn slowly (since only 15% of the input
tokens are masked).
71!
Credit: The Illustrated BERT. Jay Alammar.!
Next Sentence Prediction
n To learn relationships between sentences
- predict whether Sentence B is actual sentence that proceeds
Sentence A, or a random one
- BERT ≈ Encoder of Transformer with Input Specifics
- [CLS]
- [SEP]
7!
Credit: The Illustrated BERT. Jay Alammar.!
Next Sentence Prediction
n Predict random words from within the sequence
- Transformer’s encoder is only a forward language model
- If they were to perform bidirectional self-attention, then the
model would learn that the next word in the sentence is the
target and would predict it always, with 100% accuracy.
- BERT solves this problem using two “new” paradigms:
- Masked Language Model
- Two-sentence Tasks
7!
Credit: The Illustrated BERT. Jay Alammar.!
73!
Two-Stage Fine-Tuning
n Fine-tuning for downstream tasks
- The BERT paper shows a number of ways to use BERT for
different tasks.
74!
BERT Evaluation
75!
n GLUE Results
- 8 GLUE classification tasks were used to assess performance.
- BERT base not only beat OpenAI GPT on all tasks achieving
SOTA, but it improved SOTA by an impressive 5% on
average. BERT large beat BERT base for all tasks as well.
BERT Features
76!
n Intrinsic Representations
- We can also use the BERT contextualized word embeddings.
Credit: The Illustrated BERT. Jay Alammar.!
BERT Features
77!
n Intrinsic Representations
- We can also use the BERT contextualized word embeddings.
Credit: The Illustrated BERT. Jay Alammar.!
BERT Features
78!
n Intrinsic Representations
- An important detail of BERT is the preprocessing used for the
input text: [CLS] [SEP]
Credit: What Does BERT Look At? An Analysis of BERT’s Attention. BlackBoxNLP 2019.!
Outline
n Language Model
n Neural Language Model: RNN
n Contextual Embeddings: ELMo
n Parallelizable: Transformers
n Milestone: BERT!
n What’s Next?
n How to Use BERT?
n How to improve BERT?
n What are Hands-on Experiences?
Syllabus
n Language Model
n Machine Translation
n Sentiment Analysis
n Machine Comprehension
n Knowledge Graph
n Text Generation