GPT-2: Language Models are Unsupervised Multitask Learners

YoungSeokKim8 2,232 views 31 slides Sep 27, 2019

Slide 1 of 31

About This Presentation

Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in...

Size: 3.34 MB

Language: en

Added: Sep 27, 2019

Slides: 31 pages

Slide Content

Language Models are  
Unsupervised Multitask Learners 
(GPT-2)
OpenAI
Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
2019.03.03
Presented by Young Seok Kim
PR-145

Articles & Useful Links
•Oﬃcial
•Technical Paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf
•Blog: https://blog.openai.com/better-language-models/
•GitHub: https://github.com/openai/gpt-2
•Unoﬃcial
•Reddit: https://www.reddit.com/r/MachineLearning/comments/aqlzde/r_openai_better_language_models_and_their/
!2

Related Papers
•Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017)
•PR-049: https://youtu.be/6zGgVIlStXs
•Tutorial with code: http://nlp.seas.harvard.edu/2018/04/03/attention.html
•Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)
•Website: https://blog.openai.com/language-unsupervised/
•Paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/
language_understanding_paper.pdf
•Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding.” (2018)
•Website: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
•Paper: https://arxiv.org/abs/1810.04805
•PR-121: https://youtu.be/GK4IO3qOnLc
!3

Dataset
!4

Dataset (BERT)
!5
BookCorpus
(800M words)
Wikipedia
(2500M words)
+

Common Crawl?
!6
•Signiﬁcant data quality issues.
•Best results were achieved when using a small
subsample of common crawl which included only
documents most similar to the target dataset
•Authors of GPT-2 wanted to avoid making
assumptions about the tasks to be performed
ahead of time.

WebText
•GPT-2 authors created a new web scrape which
emphasizes document quality
•They scraped web pages which have been curated/
ﬁltered by humans
•Manually ﬁltering a full web scrape would be
exceptionally expensive
•Scraped all outbound links from Reddit, which
received at least 3 karma
•Heuristic indicator for whether other users found the
link interesting / educational / or just funny
!7
Karma > 3

WebText
•45 million links
•Used content extractors to extract the text from HTML
•De-duplication
•heuristic based cleaning
•slightly over 8 million documents
•40 GB of text
•Removed ALL Wikipedia documents
•since it is a coomon data source for other datasets and could complicate analysis due to overlapping
training data with test evaluation tasks
!8

Input Representation
!9

Byte Pair Encoding (BPE)
•Sennrich, Rico et al.  
“Neural Machine Translation of Rare Words with Subword Units.” (2016)
•Practical middle ground between character level and word level language modeling
•Eﬀectively interpolates between word level inputs for frequent symbol sequences and
character level inputs for infrequent symbol sequences
•Combined empirical beneﬁts of word-level LMs with the generality of byte-level
approaches
•This approach can assign a probability to any Unicode string, regardless of pre-
processing, tokenization or vocabulary size
!10

Byte Pair Encoding
(BPE)
Sennrich, Rico et al. “Neural Machine Translation of Rare Words with Subword Units.” (2016)

Model
!12

Transformer
•Transformer-based
•Follows the details of GPT-1
•Layer Normalization was moved to the input of each sub-block  
(similar to pre-activation in ResNet)
•Additional LayerNorm was added after the ﬁnal self-attention
block.
•Vocab is expanded to 50,257
•Batchsize of 512 is used
!13
Original Transformer

Experiments
!14

Model sizes
!15
(BERT)
GPT-2
GPT-1
BERT-large
GPT-2

Zero-shot results
!16

Children’s Book Test
•Hill, Felix et al. “The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations.” (2016)
•Reports accuracy on automatically constructed cloze test where the task is to predict
which of 10 possible choices for an omitted word is correct.
•GPT-2 authors compute the probability of each choice and the rest of sentence
conditioned on this choice according to LM, and predict the one with highest
probability.
!17

LAMBADA
•LAnguage Modeling Broadened to Account for Discourse Aspects
•Paperno, Denis et al. “The LAMBADA dataset: Word prediction requiring a broad
discourse context.” (2016)
•Task is to predict the ﬁnal word of sentences which require at least 50 tokens of
context for a human to successfully predict
•99.8 PPL -> 8.63 PPL
!18

Winograd Schema Challenge
•Commonsense reasoning by
measuring its ability to resolve
ambiguities in text
!19

Winograd Schema Challenge
Trinh, Trieu H. and Quoc V. Le. “A Simple Method for Commonsense Reasoning.” (2018)

Summarization
•Added text “TL;DR:” after the
article and generated 100 tokens
with Top-k random sampling with
k=2
•CNN and Daily Mail dataset
•Used 3 generated sentences in
these 100 tokens as the summary
!21

Translation
•‘english sentence = french sentence ’ format
•Generate text after ‘english sentence = ’
•Sample from the model with greedy decoding and use the ﬁrst generated sentence as the translation
•GPT-2 gets 5 BLEU on WMT-14 English-French test set
•GPT-2 gets 11.5 BLEU on WMT-14 French-English test set
•Outperforms several unsupervised machine translation baselines (2017)
•But still much worse than 33.5 BLEU of the current SOTA of unsupervised machine translation (2019)
!22

Translation
•Surprising result!
•Authors of GPT-2 deliberately removed non-English webpages from WebText as a
ﬁltering step
•Authors ran byte-level language detector on WebText
•Only 10MB of data in the French language
•(Approximately 500x smaller than the monolingual French corpus common in prior
unsupervised machine translation research)
!23

Question Answering
•GPT-2 answers 4.1% of questions correctly when evaluated by the exact match metric
commonly used on reading comprehension datasets like SQUAD
•Smallest model does not exceed 1.0% accuracy of an incredibly simple baseline which
returns the most common answer for each question type (who, what, where, etc…)
•-> Model capacity is important
•But, GPT-2 has an accuracy of 63.1% on the 1% of the questions it is most conﬁdent in
!24

Generalization vs Memorization
•It is important to analyze how much test data also shows up in the training data
•Using Bloom Filters, authors found out what percentage of (test) dataset is found in
WebText training set.
!25

WebText Underﬁtting
!26

Conclusionss
•Unsupervised task learning is an additional promising area of research to explore
•Performance of GPT-2 is competitive with supervised baselines in a zero-shot setting.
•on reading comprehension
•but not on other tasks like summarization, etc…
•Studied zero-shot performance of WebText LMs on many canonical NLP tasks
!27

Discussions
!28

Personal Thoughts
•Rather than focusing on novel model architecture, the paper focuses on unsupervised
task learning, evaluating / analyzing on various canonical datasets / tasks
•Compared to the hype, the model is quite less achieving
•Scaling is important. Modern research by huge companies have already transitioned to
huge models
•Zero-shot learning is interesting
!29

How do you think about  
OpenAI not releasing the model?
(Is it ethical for OpenAI to keep the big model private?)
•Propagate Fear
•Reproducibility issue
•Making unnecessary hype
!30
•May be used for malicious use such as
•Generate misleading news articles
•Automate the production of abusive or faked
content to post on social media
•Automate the production of spam/phishing
content

GPT-2: Language Models are Unsupervised Multitask Learners

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

GPT-2: Language Models are Unsupervised Multitask Learners

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx