社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf

NABLAS 403 views 7 slides Jun 28, 2024
Slide 1
Slide 1 of 7
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7

About This Presentation

社内勉強会の資料「XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model 」を公開しました!

・ニューラルコーデックを使った音声表現を採用
・GPT2ベースのデコーダとPerceiver構造のスピーカーエンコーダ
・特に英語で優れた性�...


Slide Content

XTTS: a Massively Multilingual Zero-
Shot Text-to-Speech Model
Casanova etal.,INTERSPEECH 2024
Paper Discussion, 28 June 2024
Presenter: Nabarun Goswami, NABLAS

Background
•Previously (some even now) speech representation for TTS models used to be (Mel-)Spectrograms
•Recently, speech representation used are from Neural Codecs (Encodec/Soundstream/etc.).
Soundstream

Advantages of Codec based Modeling over Spectrogram
•Spectrograms are continuous, hence typical loss functions include Mean Squared Error (MSE/L2) or Mean Absolute Error (MAE/L1).
•MSE works by maximizing the likelihood of observed data under Gaussian error model, while MAE under a Laplacian error model.
•However, when dealing with discrete tokens, classification approach is used, i.e. predict the token label from a fixed vocabulary.
•CrossEntropy loss is used, which measures the divergence between true distribution and predicted distribution without making explicit assumptions about the underlying distribution.
•This makes it more flexible for real world data (discrete speech tokens from neural codec models)

XTTS
Perceiver, Jaegle+, ICML 2021
1.Train 8192-token mel-spec VQ-VAE (neural codec)
2.Use 6681-token BPE text tokenizer.
3.Train GPT2 with LM heads predicting audio codes from
Step 1.
4.Use Perceiver architecture for speaker conditioning.
5.Train decoder/vocoder on GPT2 latents before the LM
heads, conditioned on pre-trained speaker encoder
6.Loss functions:
a)GPT2: Crossentropy
b)Decoder:
i.Reconstruction (L1/L2),
ii.Adversarial,
iii.Speaker concistency

Dataset
•Sources:
•English: LibriTTS-R,
LibriLight, Internal dataset
•Others: Commonvoice

Results
https://huggingface.co/spaces/coqui/xtts

Discussion
•Good:
•Speech quality is quite good
•Perceiver allows multiple reference audios without length limitation
•HiFi-GAN based vocoder from GPT2 latents reduces some inference latency
•Not so Good:
•Japanese, Korean and Chinese are romanized before tokenization.
•CER for these language is quite high compared to other languages
•GPT2 is decoder only transformer
•Potential for hallucinations
•Slower inference, one token/frame at a time