LLM GPT-3: Language models are few-shot learners

YanXu646657 163 views 25 slides May 17, 2024
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

Introduce GPT-3 paper: Language models are few-shot learners


Slide Content

GPT-3: Language models are few-shot learners LLM Reading Group

GPT-3 Applications: ChatBot

GPT-3 Applications: Summarization

GPT-3 Applications: Building Apps Demo

From GPT to GPT-4 Training language models to follow instructions with human feedback (GPT-3.5/ InstructGPT ) – over 350B parameters ChatGPT Release Large-scale Multimodal model with better post-training alignment ( GPT-4) – over 1.5T parameters 06/2017 02/2019 05/2020 03/2022 03/2023 11/2022 Attention Is All You Need 06/2018 Pre-train and Fine-tune Zero-shot In-context few-shot Human Alignment Transformer Architecture Multi-modal

GPT: Predicting the next token

GPT-3 Model Architecture Alternating dense and locally banded sparse attention patterns, similar to the  Sparse Transformer . Layer normalization was moved to the input of each sub-block, and an additional layer normalization was added after the final self-attention block. We scale the weights of residual layers at initialization by a factor of 1/ √ N where N is the number of residual layers. The vocabulary is expanded to 50,257. We also increase the context size from 512 to 1024 tokens and a larger batch size of 512 is used. GPT GPT-2 GPT-3

GPT-3: Increasing model size Compare the model performance across different NLP tasks with an increasing model size.

In-context Learning

Training: Datasets used to train GPT-3

Evaluation For few-shot learning, we evaluate each example in the evaluation set by randomly drawing K examples from that task’s training set as conditioning (in-context examples), delimited by 1 or 2 newlines depending on the task. K can be any value from 0 to the maximum amount allowed by the model’s context window, which is nctx = 2048 for all models and typically fits 10 to 100 examples. Larger values of K are usually but not always better On tasks with free-form completion, we use beam search with a beam width of 4 and a length penalty of α = 0.6.

Task Phrasing and Specifications

commonsense reasoning Language understanding Natural language inference (entailment/contradiction/neutral) From non-English to English

PIQA: Physical Interaction: Question Answering

COPA: Choice Of Plausible Alternatives SuperGLUE : Super general language understanding evaluation

GPT-3 model is biased and tend to reflect stereotypes present in their training data.

commonsense reasoning Language understanding Natural language inference (entailment/contradiction/neutral) From non-English to English Open-Book QA

From GPT to GPT-4 Training language models to follow instructions with human feedback (GPT-3.5/ InstructGPT ) – over 350B parameters ChatGPT Release Large-scale Multimodal model with better post-training alignment ( GPT-4) – over 1.5T parameters 06/2017 02/2019 05/2020 03/2022 03/2023 11/2022 Attention Is All You Need 06/2018 Pre-train and Fine-tune Zero-shot In-context few-shot Human Alignment Transformer Architecture Multi-modal More Coming Up!