Training language models to follow instructions with human feedback (Instruct GPT)
RamaIrsheidat1
599 views
56 slides
Jun 16, 2023
Slide 1 of 56
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
About This Presentation
Training language models to follow instructions with human feedback (InstructGPT).pptx
Long Ouyang, Jeff Wu, Xu Jiang et al. (OpenAI)
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are ...
Training language models to follow instructions with human feedback (InstructGPT).pptx
Long Ouyang, Jeff Wu, Xu Jiang et al. (OpenAI)
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.
Size: 8.95 MB
Language: en
Added: Jun 16, 2023
Slides: 56 pages
Slide Content
Training language models to follow instructions with human feedback ( InstructGPT) Long Ouyang, Jeff Wu, Xu Jiang et al. ( OpenAI ) Rama Irsheidat
Introduction 01 TABLE OF CONTENTS 02 Objectives 03 Methodology 04 Future works 05 Conclusion
01 Introduction
Model Owner ( T eam of authors) Long Ouyang* Jeffrey Wu* Research Scientist at OpenAI Research engineer on OpenAI's safety team Open AI team* AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. * All the Pictures and information about the authors and the company from LinkedIn
Large language models (LMs) can perform a range of natural language processing (NLP) tasks when prompted with examples. However, these models often exhibit unintended behaviors , such as generating biased or toxic text, making up facts, or not following user instructions. They aim to align LMs by training them to act in accordance with the user's intention, including explicit and implicit intentions. But evaluating the model to be helpful, honest, and harmless. Fine-tuning approaches to align language models to follow a broad class of written instructions with reinforcement learning from human feedback (RLHF).
Pre-training task Since they focus on fine-tuning the pre-trained GPT-3 language model with human feedback . GPT-3 is pre-trained on a large corpus of text data using a language modeling objective, which involves predicting the next word in a sequence of text .
Three steps for building InstructGPT
As a first step, they hired a team of 40 contractors based on their screening test results . Then, they trained their supervised learning baselines based on human-written demonstrations of the desired output behavior using (mostly English) prompts submitted to the OpenAI API and some labeler-written prompts . ( Source of training data ) Supervised fine-tuning model
Three steps for building InstructGPT
Secondly, gathering a dataset of human-labeled comparisons between outputs from OpenAI's models on a larger set of API prompts . Then, train a reward model (RM) on this dataset to predict which model output their labelers would prefer. Reward model (RM) training
Three steps for building InstructGPT
Finally, they use this RM as a reward function and fine-tune the supervised learning baseline to maximize this reward using the PPO algorithm . Optimizing a policy against the reward model using Reinforcement learning (RL)
They primarily evaluate their models by having their labelers rate the quality of model outputs on the test set , consisting of prompts from held-out customers who were not included in the training data. Additionally, they perform automatic evaluations using various public NLP datasets . Evaluation
SFT : Stands for a supervised fine-tuning model. PPO : Stands for proximal policy optimization, which is a reinforcement learning algorithm used in this research paper to fine-tune the language model with human feedback. PPO-ptx : Is a variant of the PPO algorithm used to fine-tune the InstructGPT models . GPT : Stands for Generative Pre-trained Transformer. This model is used for natural language processing (NLP) tasks. GPT(prompted) : Fine-tuning the GPT model with human feedback. ( GPT, GPT prompted ) : GPT-3 baselines. They found that outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3 model, despite having 100x fewer parameters. Human evaluations on OpenAI API prompt distribution
They experiment with different sizes of the GPT-3 language models ( 1.3B , 6B , and 175B parameters). They also compare the performance of their InstructGPT models, which have 1.3B parameters, to the original GPT-3 models as we showed in the previous slide. Variants of the model in terms of size
InstructGPT (Gpt-3) model architecture S ince InstructGPT is based on GPT-3, it is likely that InstructGPT is also a decoder model.
Comparing the performance of GPT-3 and InstructGPT models on various tasks InstructGPT InstructGPT models outperform in terms of generating appropriate outputs , following explicit constraints in the instructions, and generating more truthful and informative answers. InstructGPT models generate information not present in the input about half as often as GPT-3 models . InstructGPT models show small improvements in toxicity . Minimizing performance regressions on public NLP datasets Generalizing to the preferences of “held-out” labelers InstructGPT models show promising generalization to instructions outside of the RLHF fine-tuning distribution GPT-3 GPT-3 models do not outperform in terms of generating appropriate outputs even when it is given a few-shot prompt to make it better at following instructions. GPT-3 models show small improvements in bias. Maximizing performance regressions on public NLP datasets GPT-3 models can perform many tasks but require more careful prompting and do not usually follow instructions. Vs
02 Objectives
Creating a language model that can follow a broad class of written instructions helpfully and safely, while avoiding generating untruthful, toxic, or otherwise harmful outputs called InstructGPT. Objective 1 Objective 2 Using human feedback to fine-tune language models is a promising approach for aligning language models with human intent. S howing that the output model has 1.3 billion parameters, which is fewer than the 175 billion parameters in the source model . Objective 3
03 Methodology
Related work 1 Methods and experimental details 2 Results 3 Additional details 5 Discussion 4
Research on alignment and learning from human feedback . Training language models to follow instructions . Evaluating the harms of language models . Modifying the behavior of language models to mitigate harms. Related work
High-level methodology Collect demonstration data and train a supervised policy Collect comparison data and train a reward model Optimize a policy against the reward model using PPO. Methods and experimental details
Dataset To train the first InstructGPT models, labelers needed to write prompts themselves—since it required an initial source of instruction-like prompts to bootstrap the process . Three kinds of prompts are used Plain - arbitrary task Few-shot - multiple query/ response pairs per instruction User-based - waitlist use cases for open Al API Methods and experimental details
Data cleaning They heuristically de-duplicate prompts by checking for prompts that share a long common prefix, and they limit the number of prompts to 200 per user ID. They also create their train, validation, and test splits based on user ID, so that the validation and test sets contain no data from users whose data is in the training set. Methods and experimental details
From the prompts used , they produced three different datasets used in their fine-tuning procedure: Methods and experimental details
Tasks Their training tasks are from two sources : D ataset of prompts written by their labelers D ataset of prompts submitted to early InstructGPT models on their API These prompts are very diverse and include generation , question answering , dialog , summarization , extractions , and other natural language tasks Methods and experimental details
Human data collection The aim was to select a group of labelers who were sensitive to the preferences of different demographic groups and able to identify potentially harmful outputs through screening tests . During training and evaluation, their alignment criteria may come into conflict: Training: prioritize helpfulness to the user Evaluating: prioritize truthfulness and harmlessness Methods and experimental details
Human data collection To test whether the model generalizes to other labelers, a separate set of labelers are hired who do not produce any training data .( Held-out labelers ) Despite the complexity of the task, they find that inter-annotator agreement rates are quite high: Training labelers: agree with each other 72.6 ± 1.5 % Held-out labelers: the number is 77.3 ± 1.3% Methods and experimental details
Reinforcement learning (RL) Supervised fine-tuning (SFT) Fine-tune GPT-3 on labeler demonstrations using supervised learning Train for 16 epochs using cosine learning rate decay and residual dropout of 0.2 SFT models overfit on validation loss after 1 epoch Starting with the SFT model with the final unembedding layer removed, train a model to take in a prompt and response and output a scalar reward Present labelers with K = 4 to K=9 responses to rank Train on all comparisons from each prompt as a single batch element Fine-tuned the SFT model with PPO Given prompt and response , it produces a reward determined by the reward model and ends the episode Add a per-token KL penalty from the SFT model at each token to mitigate over-optimization of the reward model Reward modeling (RM) Models Methods and experimental details
Loss function for the reward model where r 𝜃 (x, y) is the scalar output of the reward model for prompt x and completion y with parameters 𝜃, y w is the preferred completion out of the pair of y w and y l , and D is the dataset of human comparisons. Methods and experimental details
Evaluations on API distribution The main metric is human preference ratings on a held-out set of prompts from the same source as training distribution Choose prompts equally between GPT and InstructGPT in order to not bias between the models Datasets with language model safety , truthfulness , toxicity , and bias They evaluate on two types of public datasets, which are FLAN and T0 , both consist of a variety of NLP tasks. Also, conduct human evaluations of toxicity on the RealToxicityPrompts dataset Evaluations on public NLP datasets Evaluation Methods and experimental details
API distribution Labelers significantly prefer InstructGPT outputs over outputs from GPT-3 The InstructGPT generalized to the preferences of ”held-out” labelers Public NLP datasets Showing improvements in truthfulness over GPT-3 Showing small improvements in toxicity over GPT-3, but not bias Minimizing performance regressions on public NLP datasets by modifying their RLHF fine-tuning procedure Results
Qualitative results They notice that it often produces an output in English even when the instruction is in another language. In comparison, they find that GPT-3 can perform these tasks but requires more careful prompting, and rarely follows instructions in these domains . Results
Preference results Results Left : results on prompts submitted to GPT models on the API. Right : results on prompts submitted to InstructGPT models on the API. Top : results from held-out labelers. Bottom : results from training labelers.
Metadata results on the API distribution Results Compared to GPT-3, the PPO models are more appropriate in the context of a customer assistant , are better at following explicit constraints in the instruction and attempting the correct instruction , and are less likely to make up information on closed domain tasks.
Comparing with FLAN and T0 in terms of Likert scores Results On a 1-7 scale, on the InstructGPT prompt distribution. FLAN and T0 perform better than default GPT-3 , and comparably with a few-shot GPT-3 model placed into ‘instruction-following’ mode.
Results on the TruthfulQA dataset Results Gray bars indicate ratings of truthfulness . Colored bars indicate ratings of truthfulness and informativeness .
Results comparisons on mainstream NLP tasks Results They compare the performance of their InstructGPT models to the original GPT-3 model on several mainstream NLP tasks, including sentiment analysis , question answering , text classification , and other natural language tasks (see Table 1). They found that the InstructGPT models performed similarly to or slightly worse than the original GPT-3 model on these tasks, but with improvements in truthfulness and reductions in toxic output generation .
Implications for alignment research The cost of increasing model alignment is modest relative to pretraining 175B model requires 60 petaflops/ day 3600 petaflops per day for GPT-3 Evidence that InstructGPT generalizes “following instructions” to settings without supervision Mitigate most of the performance degradations introduced by fine-tuning Validated alignment techniques with real-world research Discussion
01 InstructGPT models' behavior is influenced by human feedback from contractors 02 Labeling tasks may be impacted by contractors' beliefs, cultural backgrounds, and personal history 03 Their team of contractors is not representative of the full spectrum of people who will use our models 04 Labelers are primarily English-speaking and their data consists almost entirely of English instructions Limitations Discussion
05 Their models are not fully aligned or fully safe , they can generate toxic or biased outputs , make up facts , and generate sexual and violent content without explicit prompting – Examples of these mistakes come up next slide 06 Their models follow the user's instruction , even if it could lead to harm in the real world Limitations Discussion
Model mistakes Confused by instruction that assume false premises Overly hedge, rather than directly answering simple questions
OQ 01 Several methods could be tried to further decrease the models’ propensity to generate toxic, biased, or otherwise harmful outputs. Training models to be harmless despite user instructions is important, but difficult since whether an output is harmful depends on the context in which it’s deployed. There is a vast space of options for designing interfaces for labelers to provide feedback to language models; this is an interesting human-computer interaction problem. How to design an alignment process that is transparent OQ 03 OQ 02 OQ 04 Discussion Open Questions
Additional details Additional prompt data details Additional human data collection details Additional model details Automatic evaluation details Additional results Model samples
Data diversity Additional prompt data details A subset of their labeled prompt metadata. Note that their annotation fields changed over the course of the project, so not every prompt was annotated for every field.
Web interface Additional human data collection details For each output, labelers give a Likert score for overall quality on a 1-7 scale and also provide various metadata label.
Web interface Additional human data collection details L abelers rank all the outputs for a given prompt. Ties are encouraged in cases where two outputs seem to be of similar quality.
Labeler demographic data Additional human data collection details
Additional results Performance on public NLP datasets Zero-shot performance of their models on various public NLP datasets
Performance on public NLP datasets Additional results Few-shot performance of their models on various public NLP datasets
04 Future work
Developing better ways to detect and remove biased or harmful content Incorporating ethical considerations into the design of their models Implementing safeguards to prevent the generation of harmful outputs
05 Conclusion
The paper concludes that fine-tuning language models with human feedback is a promising direction for aligning these models with user intent . The authors demonstrate this approach using the GPT-3 language model and show that their method, called InstructGPT, can improve the truthfulness and reduce the toxicity of model outputs while maintaining performance on public NLP datasets. They also found that the 1.3B parameter InstructGPT model is preferred to the 175B GPT-3 model in human evaluations, despite having fewer parameters . The authors suggest that their method could be applied to a wide range of NLP tasks and could help address concerns about the ethical implications of large language models.