社内勉強会資料_Skywork-MoE .

NABLAS 199 views 10 slides Oct 21, 2024
Slide 1
Slide 1 of 10
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10

About This Presentation

Skywork-MoE と呼ばれる、MoE言語モデルのトレーニング手法について説明しています。

This presentation explains the training technique for the MoE language model, "Skywork-MoE".
Skywork Moe is a 146B-parameter mixture-of-experts LLM that achieves competitive per...


Slide Content

Skywork-MoE: A Deep Dive into Training
Techniques for Mixture-of-Experts
Language Models

Introduction
●Paper: arxiv.org/pdf/2406.06563
●Github: SkyworkAI/Skywork-MoE
●Skywork-MoE
○Mixture-of-experts LLM
○146B / 22B activated
○16 experts
○Competitive against ~70B dense models
●Investigates initialization method
●Investigates training techniques

Quick recap on mixture-of-experts
Hypothesis: the bigger the models, the better
Problem: bigger models are more expensive to train
Solution: sparse models which give us the same power as bigger models with
less computations, mixture-of-experts are a type of sparse model
Popular implementation in Transformer-based models: sparsify the FF layers

More info.

Preliminaries
●Architecture based on Switch Transformers

Upcycling vs. from scratch
●Learning rate scheduling critical
●Diversification of the experts essential for performance
●If the budget for MoE is very small, upcycle, otherwise train from scratch
●Rule of thumb, more experimentation needed!

Training techniques: Gating Logit Normalization

●Routers tend to not discriminate between experts
●Makes the router probabilities sharper
●λ = 1 seems to be the best value

Training techniques: Adaptive Auxiliary Loss Coefficients
●Layer-wise auxiliary loss



●Increase the auxiliary loss coefficients when the token dropping rate
increases, decrease when it decreases

Skywork-MoE
1st, 2nd, 3rd
Absolute performance is good but not exceptional.
Beats Mixtral 8*22B and DBRX-Instruct, the most similar models.
Competitive with ~70B models for only 22B activated.

Additional experiments

Discussion
●Why use token dropping?
○Token dropping happens when experts receive more tokens than they should
○Artifact of the compute architecture which required all experts to receive the same number of
tokens
○MegaBlocks / Megatron can compute efficiently dynamic number of tokens per expert
○Used as a regularization?



●Multimodality not experimented
MegaBlocks