GPT-2: Language Models are Unsupervised Multitask Learners

YoungSeokKim8 2,232 views 31 slides Sep 27, 2019
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in...


Slide Content

Language Models are 

Unsupervised Multitask Learners

(GPT-2)
OpenAI
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
2019.03.03
Presented by Young Seok Kim
PR-145

Articles & Useful Links
•Official
•Technical Paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
•Blog: https://blog.openai.com/better-language-models/
•GitHub: https://github.com/openai/gpt-2
•Unofficial
•Reddit: https://www.reddit.com/r/MachineLearning/comments/aqlzde/r_openai_better_language_models_and_their/
!2

Related Papers
•Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017)
•PR-049: https://youtu.be/6zGgVIlStXs
•Tutorial with code: http://nlp.seas.harvard.edu/2018/04/03/attention.html
•Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)
•Website: https://blog.openai.com/language-unsupervised/
•Paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/
language_understanding_paper.pdf
•Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding.” (2018)
•Website: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
•Paper: https://arxiv.org/abs/1810.04805
•PR-121: https://youtu.be/GK4IO3qOnLc
!3

Dataset
!4

Dataset (BERT)
!5
BookCorpus
(800M words)
Wikipedia
(2500M words)
+

Common Crawl?
!6
•Significant data quality issues.
•Best results were achieved when using a small
subsample of common crawl which included only
documents most similar to the target dataset
•Authors of GPT-2 wanted to avoid making
assumptions about the tasks to be performed
ahead of time.

WebText
•GPT-2 authors created a new web scrape which
emphasizes document quality
•They scraped web pages which have been curated/
filtered by humans
•Manually filtering a full web scrape would be
exceptionally expensive
•Scraped all outbound links from Reddit, which
received at least 3 karma
•Heuristic indicator for whether other users found the
link interesting / educational / or just funny
!7
Karma > 3

WebText
•45 million links
•Used content extractors to extract the text from HTML
•De-duplication
•heuristic based cleaning
•slightly over 8 million documents
•40 GB of text
•Removed ALL Wikipedia documents
•since it is a coomon data source for other datasets and could complicate analysis due to overlapping
training data with test evaluation tasks
!8

Input Representation
!9

Byte Pair Encoding (BPE)
•Sennrich, Rico et al. 

“Neural Machine Translation of Rare Words with Subword Units.” (2016)
•Practical middle ground between character level and word level language modeling
•Effectively interpolates between word level inputs for frequent symbol sequences and
character level inputs for infrequent symbol sequences
•Combined empirical benefits of word-level LMs with the generality of byte-level
approaches
•This approach can assign a probability to any Unicode string, regardless of pre-
processing, tokenization or vocabulary size
!10

Byte Pair Encoding
(BPE)
Sennrich, Rico et al. “Neural Machine Translation of Rare Words with Subword Units.” (2016)

Model
!12

Transformer
•Transformer-based
•Follows the details of GPT-1
•Layer Normalization was moved to the input of each sub-block 

(similar to pre-activation in ResNet)
•Additional LayerNorm was added after the final self-attention
block.
•Vocab is expanded to 50,257
•Batchsize of 512 is used
!13
Original Transformer

Experiments
!14

Model sizes
!15
(BERT)
GPT-2
GPT-1
BERT-large
GPT-2

Zero-shot results
!16

Children’s Book Test
•Hill, Felix et al. “The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations.” (2016)
•Reports accuracy on automatically constructed cloze test where the task is to predict
which of 10 possible choices for an omitted word is correct.
•GPT-2 authors compute the probability of each choice and the rest of sentence
conditioned on this choice according to LM, and predict the one with highest
probability.
!17

LAMBADA
•LAnguage Modeling Broadened to Account for Discourse Aspects
•Paperno, Denis et al. “The LAMBADA dataset: Word prediction requiring a broad
discourse context.” (2016)
•Task is to predict the final word of sentences which require at least 50 tokens of
context for a human to successfully predict
•99.8 PPL -> 8.63 PPL
!18

Winograd Schema Challenge
•Commonsense reasoning by
measuring its ability to resolve
ambiguities in text
!19

Winograd Schema Challenge
Trinh, Trieu H. and Quoc V. Le. “A Simple Method for Commonsense Reasoning.” (2018)

Summarization
•Added text “TL;DR:” after the
article and generated 100 tokens
with Top-k random sampling with
k=2
•CNN and Daily Mail dataset
•Used 3 generated sentences in
these 100 tokens as the summary
!21

Translation
•‘english sentence = french sentence ’ format
•Generate text after ‘english sentence = ’
•Sample from the model with greedy decoding and use the first generated sentence as the translation
•GPT-2 gets 5 BLEU on WMT-14 English-French test set
•GPT-2 gets 11.5 BLEU on WMT-14 French-English test set
•Outperforms several unsupervised machine translation baselines (2017)
•But still much worse than 33.5 BLEU of the current SOTA of unsupervised machine translation (2019)
!22

Translation
•Surprising result!
•Authors of GPT-2 deliberately removed non-English webpages from WebText as a
filtering step
•Authors ran byte-level language detector on WebText
•Only 10MB of data in the French language
•(Approximately 500x smaller than the monolingual French corpus common in prior
unsupervised machine translation research)
!23

Question Answering
•GPT-2 answers 4.1% of questions correctly when evaluated by the exact match metric
commonly used on reading comprehension datasets like SQUAD
•Smallest model does not exceed 1.0% accuracy of an incredibly simple baseline which
returns the most common answer for each question type (who, what, where, etc…)
•-> Model capacity is important
•But, GPT-2 has an accuracy of 63.1% on the 1% of the questions it is most confident in
!24

Generalization vs Memorization
•It is important to analyze how much test data also shows up in the training data
•Using Bloom Filters, authors found out what percentage of (test) dataset is found in
WebText training set.
!25

WebText Underfitting
!26

Conclusionss
•Unsupervised task learning is an additional promising area of research to explore
•Performance of GPT-2 is competitive with supervised baselines in a zero-shot setting.
•on reading comprehension
•but not on other tasks like summarization, etc…
•Studied zero-shot performance of WebText LMs on many canonical NLP tasks
!27

Discussions
!28

Personal Thoughts
•Rather than focusing on novel model architecture, the paper focuses on unsupervised
task learning, evaluating / analyzing on various canonical datasets / tasks
•Compared to the hype, the model is quite less achieving
•Scaling is important. Modern research by huge companies have already transitioned to
huge models
•Zero-shot learning is interesting
!29

How do you think about 

OpenAI not releasing the model?
(Is it ethical for OpenAI to keep the big model private?)
•Propagate Fear
•Reproducibility issue
•Making unnecessary hype
!30
•May be used for malicious use such as
•Generate misleading news articles
•Automate the production of abusive or faked
content to post on social media
•Automate the production of spam/phishing
content

Thank you!
!31