This presentation explains the training technique for the MoE language model, "Skywork-MoE".
Skywork Moe is a 146B-parameter mixture-of-experts LLM that achieves competitive per...
Skywork-MoE と呼ばれる、MoE言語モデルのトレーニング手法について説明しています。
This presentation explains the training technique for the MoE language model, "Skywork-MoE".
Skywork Moe is a 146B-parameter mixture-of-experts LLM that achieves competitive performance using only 22B active parameters through novel initialization and training techniques.
Size: 1.1 MB
Language: en
Added: Oct 21, 2024
Slides: 10 pages
Slide Content
Skywork-MoE: A Deep Dive into Training
Techniques for Mixture-of-Experts
Language Models
Quick recap on mixture-of-experts
Hypothesis: the bigger the models, the better
Problem: bigger models are more expensive to train
Solution: sparse models which give us the same power as bigger models with
less computations, mixture-of-experts are a type of sparse model
Popular implementation in Transformer-based models: sparsify the FF layers
More info.
Preliminaries
●Architecture based on Switch Transformers
Upcycling vs. from scratch
●Learning rate scheduling critical
●Diversification of the experts essential for performance
●If the budget for MoE is very small, upcycle, otherwise train from scratch
●Rule of thumb, more experimentation needed!
Training techniques: Gating Logit Normalization
●Routers tend to not discriminate between experts
●Makes the router probabilities sharper
●λ = 1 seems to be the best value
Training techniques: Adaptive Auxiliary Loss Coefficients
●Layer-wise auxiliary loss
●Increase the auxiliary loss coefficients when the token dropping rate
increases, decrease when it decreases
Skywork-MoE
1st, 2nd, 3rd
Absolute performance is good but not exceptional.
Beats Mixtral 8*22B and DBRX-Instruct, the most similar models.
Competitive with ~70B models for only 22B activated.
Additional experiments
Discussion
●Why use token dropping?
○Token dropping happens when experts receive more tokens than they should
○Artifact of the compute architecture which required all experts to receive the same number of
tokens
○MegaBlocks / Megatron can compute efficiently dynamic number of tokens per expert
○Used as a regularization?