Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012
E.Godoy Biography • United States (Rhode Island): Native • Hometown: Middletown, RI • Undergrad & Masters at MIT (Boston Area) • France (Lannion): 2007-2011 • Worked on my PhD at Orange Labs • Ira...
Voice Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012
E.Godoy Biography • United States (Rhode Island): Native • Hometown: Middletown, RI • Undergrad & Masters at MIT (Boston Area) • France (Lannion): 2007-2011 • Worked on my PhD at Orange Labs • Iraklio: Current • Work at FORTH with Prof Stylianou E.Godoy, Voice Conversion
Professional Background • B.S. &M.Eng from MIT, Electrical Eng. • Specialty in Signal Processing • Underwater acoustics: target physics, environmental modeling, torpedo homing • Antenna beamforming (Masters): wireless networks • PhD in Signal Processing at Orange Labs • Speech Processing: Voice Conversion • Speech Synthesis Team (Text-to-Speech) • Focus on Spectral Envelope Transformation • Post-Doctoral Research at FORTH • LISTA: Speech in Noise & Intelligibility • Analyses of Human Speaking Styles (e.g. Lombard, Clear) • Speech Modifications to Improve Intelligibility E.Godoy, Voice Conversion
Today’s Lecture: Voice Conversion • Introduction to Voice Conversion • Speech Synthesis Context (TTS) • Overview of Voice Conversion • Spectral Envelope Transformation in VC • Standard: Gaussian Mixture Model • Proposed: Dynamic Frequency Warping + Amplitude Scaling • Conversion Results • Objective Metrics & Subjective Evaluations • Sound Samples • Summary & Conclusions E.Godoy, Voice Conversion
Voice Conversion (VC) • Transform the speech of a (source) speaker so that it sounds like the speech of a different (target) speaker. Voice Conversion This is awesome! This is awesome! He sounds like me! ??? source target E.Godoy, Voice Conversion
Context: Speech Synthesis • Increase in applications using speech technologies • Cell phones, GPS, video gaming, customer service apps… • Information communicated through speech! • Text-to-Speech (TTS) Synthesis • Generate speech from a given text Insert your card. Turn left! This is Abraham Lincoln speaking… Ha ha! Next Stop: Lannion Text-to-Speech! 7 E.Godoy, Voice Conversion E.Godoy, Guest Lecture December 11, 2012
Text-to-Speech (TTS) Systems • TTS Approaches • Concatenative: speech synthesized from recorded segments • Unit-Selection: parts of speech chosen from corpora & strung together High-quality synthesis, but need to record & process corpora • Parametric: speech generated from model parameters • HMM-based: speaker models built from speech using linguistic info Limited quality due to simplified speech modeling & statistical averaging Concatenative or Paramateric? E.Godoy, Voice Conversion
Text-to-Speech (TTS) Example voice E.Godoy, Voice Conversion
Voice Conversion: TTS Motivation • Concatenative speech synthesis • High-quality speech • But, need to record & process a large corpora for each voice • Voice Conversion • Create different voices by speech-to-speech transformation • Focus on acoustics
Size: 241.77 KB
Language: en
Added: Sep 12, 2024
Slides: 8 pages
Slide Content
Reclamation-based-Voice-Conversion( RVC) GURUMURTHY.V(421122102050) DHIVYAKUMARAN.K(421122102037) GOKULRAJ.R(421122102044) PRESENTED BY Under The Guidance of MRS.K.KALAISELVI/AP/CSE
PROBLEM STATEMENT Naturalness: Maintaining the naturalness of the source speaker's voice while converting it to the target speaker is challenging. Artifacts and distortions can occur. Data Quality: Noise, background interference, and inconsistent recording conditions can negatively impact the performance of the model. Accurate Retrieval: The quality of the converted speech depends on the accuracy of the retrieved target speaker utterances. Ineffective retrieval can result in mismatched voice characteristics.
PROBLEM STATEMENT DIAGRAM The quality of the data significantly impacts the model's performance. Noise, background interference, and inconsistent recording conditions can degrade . The development of timbre conversion and imitation is not perfect.
Voice conversion models require a vast amount of data to learn intricate voice characteristics. Collecting and processing such a large dataset can be time-consuming and expensive.
PROPOSED SYSTEM A powerful speech encoder is trained on a large dataset to extract high-level acoustic features that capture speaker identity and linguistic content. Given a target speaker's voice, a set of representative speech segments is extracted and encoded into the same feature space. For a given input speech, the encoder generates its corresponding feature representation.
EXISTING SYSTEM S. NO TITLE AUTHOR YEAR PUBLISHED DRAWBACK 1. Voice Conversion from a Single Speaker Example Jun Wu, Yi Li, and Yiheng Liu 2015 Requires a large amount of training data for each speaker. 2. A Waveform Generation Model for High-Quality Speech Synthesis Tomoki Kaneko, Kazuhiro Takahashi, and Shinji Nakamura 2018 Can produce artifacts in the synthesized speech, especially for unseen speakers. 3. A Generative Adversarial Network Approach to Text-to- Speech Synthesis Zheng Wang, Jonathan Sotelo, Haoyu Wu, Gregor Kurz 2017 The generated speech quality can be sensitive to the quality of the text-to- speech (TTS) model used.
S. NO TITLE AUTHOR YEAR PUBLISHED DRAWBACK 4. A Generative Model for Raw Audio Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu 2016 Requires a significant amount of computational resources for training and inference. 5. A Flow-based Generative Model for Text-to-Speech Synthesis Jongwook Kim, Hyunjun Kim, and Ho-Sang Lee 2020 Can struggle with preserving the naturalness and expressiveness of the original voice.