2024-02, Efficient Multilingual Language Model Compression through Vocabulary Trimming

asahiushio1 15 views 13 slides Aug 22, 2024

Slide 1 of 13

About This Presentation

https://aclanthology.org/2023.findings-emnlp.981/

Size: 1.69 MB

Language: en

Added: Aug 22, 2024

Slides: 13 pages

Slide Content

Efficient Multilingual Language
Model Compression through
Vocabulary Trimming

Asahi Ushio
9th Feb 2024
Paper link
https://github.com/asahi417/lm-vocab-trimmer

Language Model (LM)
Slide credit: Stanford AI

Decoder, Encoder…?
Decoder LM
●a.k.a. Autoregressive LM
●a.k.a. Unidirectional LM
●a.k.a. Causal LM
●eg) GPT, PaLM, Llama
●Generation (dialogue,
completion)
Encoder LM
●a.k.a. Masked LM
●a.k.a. Bidirectional LM
●eg) BERT, RoBERTa
●Classification
(sentiment, NER, search)
Encoder-Decoder LM
●a.k.a. Seq2seq LM
●a.k.a. Prefix LM
●eg) T5, BART, UL2
●Reasoning (QA, QG, translation,
summarization)

Multilingual LM
en
ja
de
LM
LM
LM
Monolingual LM
●Pretraining LM for each language is
expensive.
●Lack of reliable LMs for many languages.

en
ja
de
Multilingual
LM
Multilingual LM
●Single LM for 100 languages.
●Many established LMs (mT5, XLM-R,
etc).

Classification
(English)
NER (Japanese)
QA (German)
NER (English)
Classification
(German)
Classification
(English)
NER (Japanese)
QA (German)
NER (English)
Classification
(German)

Multilingual LMs are Bulky
Multilingual LMs have larger vocabulary.
●T5 Small (90M) vs mT5 Small (300M)
●BART Large (140M) vs mBART Large (600M)
●RoBERTa Base (140M) vs XLM-R Base (270M)
Same architecture (number of layer, hidden dimension,
etc).

Embedding Matrix
86M
(30%)
192M
(70%)
XLM-RBASE
(278M)
44M
(14%)
256M
(86%)
mT5SMALL
(300M)
mBARTLARGE
(611M)
Embedding Matrix
Other Weights
256M
(41%)
354M
(59%)

Embedding Matrix
86M
(30%)
192M
(70%)
XLM-RBASE
(278M)
44M
(14%)
256M
(86%)
mT5SMALL
(300M)
mBARTLARGE
(611M)
Embedding Matrix
Other Weights
256M
(41%)
354M
(59%)
Research Question
We finetune multilingual LMs on
monolingual tasks.

Can we drop those unused tokens
at the inference to reduce the
model size? ??????

Vocabulary Trimming
86M
(30%)
192M
(70%)
XLM-RBASE
(278M)
44M
(14%)
256M
(86%)
mT5SMALL
(300M)
mBARTLARGE
(611M)
Embedding Matrix
Other Weights
256M
(41%)
354M
(59%)
61M (VT ✂ )
46M (VT✂)
61M (VT✂)

What’s VT?
KoreanEnglish
…
French
Multilingual LM
Embedding Matrix Other
Weights
French
French-Trimmed LM
Other
Weights
Korean
Other
Weights
Korean-Trimmed LM
Embed. Embed.

Two variations of VT
VT(French) FT(French)
Multilingual
LM
Pre-FT VT
Post-FT VT
FT(French) VT(French)

●Model: mT5 small
●Ans-F1, METEOR: Higher is better.
●No-Trim: 250K tokens (300M params).
●Trim: 90K (136M), 60K (105M), 30K (74M), 15K (59M).

Question Answering (QA) and Question Generation (QG)

2024-02, Efficient Multilingual Language Model Compression through Vocabulary Trimming

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

2024-02, Efficient Multilingual Language Model Compression through Vocabulary Trimming

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......