社内勉強会資料_Skywork-MoE .

NABLAS 199 views 10 slides Oct 21, 2024

Slide 1 of 10

About This Presentation

Skywork-MoE と呼ばれる、MoE言語モデルのトレーニング手法について説明しています。

This presentation explains the training technique for the MoE language model, "Skywork-MoE".
Skywork Moe is a 146B-parameter mixture-of-experts LLM that achieves competitive per...

Size: 1.1 MB

Language: en

Added: Oct 21, 2024

Slides: 10 pages

Slide Content

Skywork-MoE: A Deep Dive into Training
Techniques for Mixture-of-Experts
Language Models

Introduction
●Paper: arxiv.org/pdf/2406.06563
●Github: SkyworkAI/Skywork-MoE
●Skywork-MoE
○Mixture-of-experts LLM
○146B / 22B activated
○16 experts
○Competitive against ~70B dense models
●Investigates initialization method
●Investigates training techniques

Quick recap on mixture-of-experts
Hypothesis: the bigger the models, the better
Problem: bigger models are more expensive to train
Solution: sparse models which give us the same power as bigger models with
less computations, mixture-of-experts are a type of sparse model
Popular implementation in Transformer-based models: sparsify the FF layers

More info.

Preliminaries
●Architecture based on Switch Transformers

Upcycling vs. from scratch
●Learning rate scheduling critical
●Diversification of the experts essential for performance
●If the budget for MoE is very small, upcycle, otherwise train from scratch
●Rule of thumb, more experimentation needed!

Training techniques: Gating Logit Normalization

●Routers tend to not discriminate between experts
●Makes the router probabilities sharper
●λ = 1 seems to be the best value

Training techniques: Adaptive Auxiliary Loss Coefficients
●Layer-wise auxiliary loss

●Increase the auxiliary loss coefficients when the token dropping rate
increases, decrease when it decreases

Skywork-MoE
1st, 2nd, 3rd
Absolute performance is good but not exceptional.
Beats Mixtral 8*22B and DBRX-Instruct, the most similar models.
Competitive with ~70B models for only 22B activated.

Additional experiments

Discussion
●Why use token dropping?
○Token dropping happens when experts receive more tokens than they should
○Artifact of the compute architecture which required all experts to receive the same number of
tokens
○MegaBlocks / Megatron can compute efficiently dynamic number of tokens per expert
○Used as a regularization?

●Multimodality not experimented
MegaBlocks

社内勉強会資料_Skywork-MoE .

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

社内勉強会資料_Skywork-MoE .

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx