XTTS: a Massively Multilingual Zero-
Shot Text-to-Speech Model
Casanova etal.,INTERSPEECH 2024
Paper Discussion, 28 June 2024
Presenter: Nabarun Goswami, NABLAS
Background
•Previously (some even now) speech representation for TTS models used to be (Mel-)Spectrograms
•Recently, speech representation used are from Neural Codecs (Encodec/Soundstream/etc.).
Soundstream
Advantages of Codec based Modeling over Spectrogram
•Spectrograms are continuous, hence typical loss functions include Mean Squared Error (MSE/L2) or Mean Absolute Error (MAE/L1).
•MSE works by maximizing the likelihood of observed data under Gaussian error model, while MAE under a Laplacian error model.
•However, when dealing with discrete tokens, classification approach is used, i.e. predict the token label from a fixed vocabulary.
•CrossEntropy loss is used, which measures the divergence between true distribution and predicted distribution without making explicit assumptions about the underlying distribution.
•This makes it more flexible for real world data (discrete speech tokens from neural codec models)
XTTS
Perceiver, Jaegle+, ICML 2021
1.Train 8192-token mel-spec VQ-VAE (neural codec)
2.Use 6681-token BPE text tokenizer.
3.Train GPT2 with LM heads predicting audio codes from
Step 1.
4.Use Perceiver architecture for speaker conditioning.
5.Train decoder/vocoder on GPT2 latents before the LM
heads, conditioned on pre-trained speaker encoder
6.Loss functions:
a)GPT2: Crossentropy
b)Decoder:
i.Reconstruction (L1/L2),
ii.Adversarial,
iii.Speaker concistency
Discussion
•Good:
•Speech quality is quite good
•Perceiver allows multiple reference audios without length limitation
•HiFi-GAN based vocoder from GPT2 latents reduces some inference latency
•Not so Good:
•Japanese, Korean and Chinese are romanized before tokenization.
•CER for these language is quite high compared to other languages
•GPT2 is decoder only transformer
•Potential for hallucinations
•Slower inference, one token/frame at a time