GPT-4: A Glimpse into GPT-4 and Let's Demystify
YanXu646657
205 views
38 slides
May 17, 2024
Slide 1 of 38
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
About This Presentation
Introduce GPT-4, a stronger GPT over GPT-3.5 with multi-modal capabilities
Size: 8.96 MB
Language: en
Added: May 17, 2024
Slides: 38 pages
Slide Content
D emystify GPT-4 Yan Xu Houston Machine Learning LLM Reading Group Feb 2, 2024
GPT-4: The UnKnown
From Transformer to GPT-4 Training language models to follow instructions with human feedback (GPT-3.5/ InstructGPT ) – over 350B parameters ChatGPT Release Large-scale Multimodal model with better post-training alignment ( GPT-4) – est. 1.5T parameters 06/2017 02/2019 05/2020 03/2022 03/2023 11/2022 Attention Is All You Need 06/2018 Pre-train and Fine-tune Zero-shot In-context few-shot Human Alignment Transformer Architecture Improved Performance & Multi-modal input
D emystify GPT-4
D emystify GPT-4
GPT-4: Mixture of Experts? Easier to train and maintain 220B
Mixture of Experts Division of a task into subtasks (Language, Math, Coding, Science etc.). Develop an expert for each subtask. Use a gating model to decide which expert to use. Pool predictions and gating model output to make a prediction.
Predictable Scaling for efficient model tuning A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. To address this, we developed infrastructure and optimization methods that have very predictable behavior across multiple scales. These improvements allowed us to reliably predict some aspects of the performance of GPT-4 from smaller models trained using 1, 000× – 10, 000× less compute.
Predictable Scaling for efficient model tuning
Inverse Scaling
D emystify GPT-4
Capabilities on Academic and Professional Exams USA Biology Olympiad
American Mathematics Competitions Capabilities on Academic and Professional Exams
Capabilities on Academic and Professional Exams
Impact of Data Contamination Overall across most exams, both contamination and vision have relatively little effect.
Impact of RLHF Comparison between GPT-4 base and GPT-4 post-RLHF on exam benchmarks. Averaged across all exams, the base model achieves an average score of 73.7% while the RLHF model achieves an average score of 74.0%, which suggests that post-training does not substantially alter base model capability .
Capabilities on Benchmarks
GPT-4 Multi-lingual Capability
D emystify GPT-4
GPT-4 Visual Inputs
GPT-4 Visual Inputs
GPT-4 Visual Inputs
GPT-4 Visual Inputs: How? No official publication about it. Flamingo visual language model by DeepMind in 2022 may inspire and lead to GPT-4. https:// slideslive.com /38991242/flamingo-a-visual-language-model-for- fewshot -learning
Flamingo: A visual language model for few-shot learning
Flamingo Architecture Overview
Flamingo Architecture Overview
Flamingo Architecture: Gated XATTN-DENSE layers
D emystify GPT-4
Model Mitigations We used a combination of dataset interventions and interventions after pre-training to mitigate harms at the model level. At the pre-training stage, we filtered our dataset mix for GPT-4 to specifically reduce the quantity of inappropriate erotic text content. We did this via a combination of internally trained classifiers and a lexicon-based approach to identify documents that were flagged as having a high likelihood of containing inappropriate erotic content After the pre-training stage, our primary method for shaping GPT-4-launch behavior was RLHF To steer our models at a more fine-grained level, we relied heavily on our models themselves as tools. One of our main tools for steering the model towards appropriate refusals is rule-based reward models (RBRMs).
Rule-based reward models Our rule-based reward models (RBRMs) are a set of zero-shot GPT-4 classifiers. The RBRM takes three inputs: the prompt (optional), the output from the policy model, and a human-written rubric (e.g., a set of rules in multiple-choice style) for how this output should be evaluated. For example, we can provide a rubric that instructs the model to classify a response as one of: a refusal in the desired style a refusal in the undesired style (e.g., evasive) containing disallowed content a safe non-refusal response. Then, on a subset of prompts that we know request harmful content such as illicit advice, we can reward GPT-4 for refusing these requests. Conversely, we can reward GPT-4 for not refusing requests on a subset of known-safe prompts.
Improvements on Safety Metrics
Examples
D emystify GPT-4
Limitations Factuality: The capacity to generate content aligned with factual information, with related issues such as Hallucinations Outdated info Domain specificity
Limitations
Limitations
Conclusion We characterize GPT-4, a large multimodal model with human-level performance on certain difficult professional and academic benchmarks. GPT-4 outperforms existing large language models on a collection of NLP tasks, and exceeds the vast majority of reported state-of-the-art systems (which often include task-specific fine-tuning). We find that improved capabilities, whilst usually measured in English, can be demonstrated in many different languages. We highlight how predictable scaling allowed us to make accurate predictions on the loss and capabilities of GPT-4. GPT-4 presents new risks due to increased capability, and we discuss some of the methods and results taken to understand and improve its safety and alignment. Though there remains much work to be done, GPT-4 represents a significant step towards broadly useful and safely deployed AI systems.