Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient Development of Customized Speech related Services
weiwchu
42 views
27 slides
Jul 19, 2024
Slide 1 of 27
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
About This Presentation
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level ...
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
Size: 9.71 MB
Language: en
Added: Jul 19, 2024
Slides: 27 pages
Slide Content
1 Wild and untamed data: large-scale publicly available, not human-labeled data . “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Harnessing Wild and Untamed (Publicly Available) Data for the Cost-efficient Development of Customized Speech-related Services Registered in California and Headquartered in San Francisco Wei Chu, Ph.D. Co-Founder and CEO, Olewave [email protected] Our exhibition booth number is E10. Olewave, a dataset and data solution provider, and a speech SDK and service provider.
About Us Olewave is a dataset and data solution provider: We delivers customized, labeled, and validated large-scale real-world text/video/image/speech datasets of various scenarios and topics in different languages. We offer customized solutions for private data labeling , eliminating the necessity for human labelers and the minimizing the risk of data breaches. Olewave is also a speech SDK and service provider: By offering our avant-garde Tycho SDK and comprehensive supporting services, We empower you to construct your in-house speech-related products efficiently and cost-effectively. 2 Registered in California, US and headquartered in San Francisco. SDK: Software Developer Kit Olewave, a dataset and data solution provider, and a speech SDK and service provider. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024.
Outline 1. Dataset and data solution 2. Tycho SDK and supporting service “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Olewave, a dataset and data solution provider, and a speech SDK and service provider.
1. Dataset and data solution 2. Tycho SDK and supporting ervice “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Olewave, a dataset and data solution provider, and a speech SDK and service provider.
Front runners of generative AI are hungry for large-scale wild data 5 Olewave, a dataset and data solution provider, and a speech SDK and service provider. [1] Reuters, Feb 21, 2024.; [2] WSJ, Mar 13, 2024; [3] NYT, Apr 8, 2024. Wild data means publicly available data here. WSJ: What do you use to train Sora? [2] CTO Murati: Google pays $60M/ yr to train model on Reddit’s posts.[1] Reddit’s net income was less than $20M/ yr before. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. OpenAI used Youtube videos to train its Whisper, a large ASR model, according to NYT’s source [3]
6 Olewave, a dataset and data solution provider, and a speech SDK and service provider. [1] WER: Word Error Rate, reported by Google. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Year Work Open-sourced works are bolded Model All is End-to-End, no LM Training d ata source Labeled (hours) Unlabeled (hours) Number of Parameters Speech Stew ( En ) WER [1] (%) Fleurs ( En only) WER(%) Fleurs( 62 Languages) Average WER(%) Common Voice ( En , De, Es, Fr) Avg WER(%) 2021 Google SpeechStew E: Transformer D: Transformer Free + Licensed 6K None 0.1 B 7.6 - n/a n/a 2022 OpenAI Whisper Transformer E2E Model 6K 60K 1.0 B 7.8 - n/a n/a Labeling methods : Weakly-/pseudo-/ASR- transcribed , or Human -transcribed WER numbers in colors : unfinetuned , e.g. 10.5; finetuned, e.g.: 7.0 Increasing exposure to out-of-domain data during training might negatively impact accuracy within the specific domain . The unlabeled data is LibriLight , containing only audiobook (reading speech); while the SpeechStew evaluation sets have conversations (spontaneous speech). The distribution shift in data could be the cause of drop in accuracy. Conclusion 1 : Avoid out-of-domain data if you have enough amount of good in-domain data
7 Olewave, a dataset and data solution provider, and a speech SDK and service provider. [ 1 ] WER: Word Error Rate, reported by Google. [2] ‘How Tech Giants Cut Corners to Harvest Data for A.I., ’ NYT. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Year Work Open-sourced works are bolded Model All are Encoder(E)-Decoder(D) models Training d ata source Labeled (hours) Unlabeled (hours) Number of Parameters Speech Stew ( En ) WER [1] (%) Fleurs ( En only) WER(%) Fleurs( 62 Languages) Average WER(%) Common Voice ( En , De, Es, Fr) Avg WER(%) 2021 Google SpeechStew E: Transformer D: Transformer Free + Licensed 6K None 0.1 B 7.6 - n/a n/a 2022 OpenAI Whisper Transformer E2E Model 6K 60K 1.0 B 7.8 - n/a n/a 2022 OpenAI Whisper v3 E: Trans/Conformer D: Transformer Mainly Youtube [2] 5M None 1.6 B 11.5 4.1 17.6 8.1 2023 Google USM Mainly Youtube 0.2M 12M 2.0 B 10.5 - 12.5 - Labeling methods : Weakly-/pseudo-/ASR- transcribed , or Human -transcribed WER numbers in colors : unfinetuned , e.g. 10.5; finetuned, e.g.: 7.0 Increasing exposure to out-of-domain data during training might negatively impact accuracy within the specific domain . SpeechStew is only trained with English data, while USM and Whisper are both trained with ~1000x more data from ~100 languages. Although their model is 10x larger, USM and Whisper ’ s WERs on English only evaluations rose ~50% compared to SpeechStew’s 0.1B model. Conclusion 1 : Avoid out-of-domain data if you have enough amount of good in-domain data
8 Olewave, a dataset and data solution provider, and a speech SDK and service provider. [ 1 ] Reported by Nvidia. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Year Work Open-sourced works are bolded Model All are Encoder(E)-Decoder(D) models Training d ata source Labeled (hours) Unlabeled (hours) Number of Parameters Speech Stew ( En ) WER (%) Fleurs ( En only) WER(%) Fleurs( 62 Languages) Average WER(%) Common Voice ( En , De, Es, Fr) Avg WER(%) [1] 2021 Google SpeechStew E: Transformer D: Transformer Free + Licensed 6K None 0.1 B 7.6 - n/a n/a 2022 OpenAI Whisper Transformer E2E Model 6K 60K 1.0 B 7.8 - n/a n/a 2022 OpenAI Whisper v3 E: Trans/Conformer D: Transformer Mainly Youtube 5M None 1.6 B 11.5 4.1 17.6 8.1 2023 Google USM Mainly Youtube 0.2M 12M 2.0 B 10.5 - 12.5 - 2024 Nvidia Canary Free + in-house 85K None 1.0 B - - - 5.8 Labeling methods : Weakly-/pseudo-/ASR- transcribed , or Human -transcribed WER numbers in colors : unfinetuned , e.g. 10.5; finetuned, e.g.: 7.0 A model trained with well-labeled in-domain data could outperform models trained with more out-of-domain data. Canary, trained with only in-domain language data which are ~100x less than the amount in other models, reported a ~10-30% lower WER. USM reported to use only 90k hours labeled data from 75 languages, and 3.5K hours of labeled English data; which is very likely to have much less amount of in-domain data (<15K) than Canary (85K). Though, labeling Canary training data with humans can exceed $5 million USD in cost. Conclusion 1 : Avoid out-of-domain data if you have enough amount of good in-domain data
9 Olewave, a dataset and data solution provider, and a speech SDK and service provider. [1] Reported by Google. [2] ‘public available repo of crawled web data’ [7] MetaAI only reported WERs of individual languages. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Year Work Open-sourced works are bolded Model All are Encoder(E)-Decoder(D) models Training d ata source Labeled (hours) Unlabeled (hours) Number of Parameters Speech Stew ( En ) WER (%) Fleurs ( En only) WER(%) Fleurs( 62 Languages) [1] Average WER(%) Common Voice ( En , De, Es, Fr) Avg WER(%) 2021 Google SpeechStew E: Conformer Transformer D: Transformer RNN-T CTC Free + Licensed 6K None 0.1 B 7.6 - n/a n/a 2022 OpenAI Whisper Transformer E2E Model 6K 60K 1.0 B 7.8 - n/a n/a 2022 OpenAI Whisper v3 Mainly Youtube 5M None 1.6 B 11.5 4.1 17.6 8.1 2023 Google USM Mainly Youtube 0.2M 12M 2.0 B 10.5/7.0 - 12.5 - 2023 MetaAI SeamlessM4T Web data [2] 350K 4.5M 2.3 B - 7.4 Uncounted [7] 6.5 2024 AssemblyAI Universal-1 Undisclosed 1.6M 12.5M 1.0 B - - - - Labeling methods : Weakly-/pseudo-/ASR- transcribed , or Human -transcribed WER numbers in colors : unfinetuned , e.g. 10.5; finetuned, e.g.: 7.0 The training data can be unlabeled and pseudo-labeled, which addressed the challenge of recruiting, managing, and quality assurance for human labelers across various languages. Universal-1 and Canary reported that they and Whisper all have less than 10% WERs on Germany, Spanish, and French evaluations. Gemini Ultra reported a WER of 7.6% on Fleurs (62-languages) evaluation. Conclusion 2 : Human/Well-labeled data are not necessary when having millions of hours of data
10 Olewave, a dataset and data solution provider, and a speech SDK and service provider. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Year Work Open-sourced works are bolded Model All are Encoder(E)-Decoder(D) models Training d ata source Labeled (hours) Unlabeled (hours) Number of Parameters Speech Stew ( En ) WER (%) Fleurs ( En only) WER(%) Fleurs( 62 Languages) Average WER(%) Common Voice ( En , De, Es, Fr) Avg WER(%) 2021 Google SpeechStew Encoder: Transformer Conformer Decoder: Transformer Free + Licensed 6K None 0.1 B 7.6 - n/a n/a 2022 OpenAI Whisper Transformer E2E Model 6K 60K 1.0 B 7.8 - n/a n/a 2022 OpenAI Whisper v3 Mainly Youtube 5M None 1.6 B 11.5 4.1 17.6 8.1 2023 Google USM Mainly Youtube 0.2M 12M 2.0 B 10.5/ 7.0 - 12.5/ 11.8 - Labeling methods : Weakly-/pseudo-/ASR- transcribed , or Human -transcribed WER numbers in colors : unfinetuned , e.g. 10.5; finetuned, e.g.: 7.0 There results showed no findings indicating an advantage of human-labeled data over pseudo-labeled/non-human-labeled data in SFT. But we have some results on that, will show that in the following slides. SpeechStew has 6K hours of well-labeled in-domain data, SFT resulted in a ~30% improvement in WER; Fleurs has 10 hours of labeled data for each language, SFT lowered the WER for ~10%. The variance in enhancement could stem from the quantity, not the labeling method, of data utilized in SFT. Conclusion 3 : R un Supervised Fine-Tuning (SFT) when using large ASR model for specific task
11 Olewave, a dataset and data solution provider, and a speech SDK and service provider. [4] Claimed better than Canary and Whisper . “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Year Work Open-sourced works are bolded Model All are Encoder(E)-Decoder(D) models Training d ata source Labeled (hours) Unlabeled (hours) Number of Parameters Speech Stew ( En ) WER (%) Fleurs ( En only) WER(%) Fleurs( 62 Languages) Average WER(%) Common Voice ( En , De, Es, Fr) Avg WER(%) 2021 Google SpeechStew Encoder: Transformer Conformer Decoder: Transformer Free + Licensed 6K None 0.1 B 7.6 - n/a n/a 2022 OpenAI Whisper Transformer E2E Model 6K 60K 1.0 B 7.8 - n/a n/a 2022 OpenAI Whisper v3 Mainly Youtube 5M None 1.6 B 11.5 4.1 17.6 8.1 2023 Google USM Mainly Youtube 0.2M 12M 2.0 B 10.5 - 12.5 - 2023 MetaAI SeamlessM4T Web data 350K 4.5M 2.3 B - 7.4 Uncounted [7] 6.5 2024 AssemblyAI Universal-1 Undisclosed 1.6M 12.5M 0.6 B - - [4] - -[4] Labeling methods : Weakly-/pseudo-/ASR- transcribed , or Human -transcribed WER numbers in colors : unfinetuned , e.g. 10.5; finetuned, e.g.: 7.0 Conclusion 4 : Too early to determine the optimal mix ratio of pseudo-labeled and unlabeled data When training a large ASR model with abundant pseudo-labeled/weakly-labeled data, is it still necessary to include unlabeled data? USM has 10x less labeled data than Whisper, and it reported ~10-30% reduction in WERs, no fine-tune. SeamlessM4T is better than Whisper on Common Voice. Whisper does very well on Fleurs (English only); Universal-1 reported better accuracies than Whisper on other English and non-English evaluations.
12 Olewave, a dataset and data solution provider, and a speech SDK and service provider. [1] WER: Word Error Rate, reported by Google. [2] Reported by Google. [3] Reported by Nvidia. [4] ‘ How Tech Giants Cut Corners to Harvest Data for A.I., ’ NYT. [5] ‘a public available repo of crawled web data’; [6] MetaAI only reported WERs of all 100 individual languages.[7] Reported En only WER is 12.6. [8] claimed better than Canary and Whisper. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Year Work Open-sourced works are bolded Model All are Encoder(E)-Decoder(D) models Training d ata source Labeled (hours) Unlabeled (hours) Number of Parameters Speech Stew ( En ) WER [1] (%) Fleurs ( En only) WER(%) Fleurs( 62 Languages) [2] Average WER(%) Common Voice ( En , De, Es, Fr) Avg WER(%) [3] 2021 Google SpeechStew E: Conformer Transformer D: Transformer RNN-T CTC Free + Licensed 6K None 0.1 B 7.6 - n/a n/a 2022 OpenAI Whisper Transformer E2E Model 6K 60K 1.0 B 7.8 - n/a n/a 2022 OpenAI Whisper v3 Mainly Youtube [4] 5M None 1.6 B 11.5 4.1 17.6 8.1 2023 Google USM Mainly Youtube 0.2M 12M 2.0 B 10.5/ 7.0 - 12.5/ 11.8 - 2023 MetaAI SeamlessM4T Web data [5] 350K 4.5M 2.3 B - 7.4 Uncounted [6] 6.5 2024 CMU’s OWSM v3.1 Free + Licensed 180K None 1.0 B - 9.0 - (Estimated [7] ) >12.6 2024 Nvidia Canary Free + in-house 85K None 1.0 B - - - 5.8 2024 AssemblyAI Universal-1 Undisclosed 1.6M 12.5M 0.6 B - - [8] - -[8] Conclusions on using wild data for training/finetuning large ASR model Labeling methods : Weakly-/pseudo-/ASR- transcribed , or Human -transcribed WER numbers in colors : unfinetuned , e.g. 10.5; finetuned, e.g.: 7.0 1: Avoid out-of-domain data if you have enough amount of good in-domain data 2 : Human/Well-labeled data are not necessary when having millions of hours of data 3 : Run Supervised Fine-Tuning (SFT) when using large ASR model for specific task 4 : Too early to determine the optimal mix ratio of pseudo-labeled and unlabeled data
Why wild data is more cost-effective than other type of data for fine-tuning speech models? 13 Olewave, a dataset and data solution provider, and a speech SDK and service provider. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024.
Free data and off-shelf data from vendors 14 Dataset sources Data features Data c osts Cost Read data: record reading prompts. Simulated data: record in simulated scenario with given topic. [1] Private data: not vendor’s data. Synthetic data: generated by AI. Web data: publicly available data Is it possible to get enough amount of In-domain data? Are transcrip-tions validated and good in quality? Are these data from real-world scenario? Data unit cost per se Extra cost incurred by data Your development cost Interaction cost with vendor Time cost incurred by data delivery Engineering cost incurred by data Free 😐 😐 😐 - - - $$ O ff-shelf data ( read&simulated )* 😐 😐 😐 $ - - $ Olewave, a dataset and data solution provider, and a speech SDK and service provider. * `Read` or `Simulated` means read speech or simulated speech. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024.
Customized data from traditional vendors 15 Dataset sources Data features Data c osts Cost Read data: record reading prompts. Simulated data: record in simulated scenario with given topic. Private data: not vendor’s data. Synthetic data: generated by AI. Web data: publicly available data Is it possible to get enough amount of In-domain data? Are transcrip-tions validated and good in quality? Are data from real-world scenario? Data unit cost per se Extra cost incurred by data Your development cost Interaction cost with vendor Time cost incurred by data delivery Engineering cost incurred by data Free 😐 😐 😐 - - - $$ O ff-shelf data ( read&simulated ) 😐 😐 😐 $ - - $ Customized r ead d ata 😊 😐 😐 $$ $$ $$ $ Customized simulated data 😊 😐 😊 $$$ $$ $$$ $ Olewave, a dataset and data solution provider, and a speech SDK and service provider. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024.
Your/Client’s private data 16 Dataset sources Data features Data c osts Cost Read data: record reading prompts. Simulated data: record in simulated scenario with given topic. Private data: not vendor’s data. Synthetic data: generated by AI. Web data: publicly available data Is it possible to get enough amount of In-domain data? Are transcrip-tions validated and good in quality? Are data from real-world scenario? Data unit cost per se Extra cost incurred by data Your development cost Interaction cost with vendor Time cost incurred by data delivery Engineering cost incurred by data Free 😐 😐 😐 - - - $$ O ff-shelf data ( read&simulated ) 😐 😐 😐 $ - - $ Customized r ead d ata 😊 😐 😐 $$ $$ $$ $ Customized simulated data 😊 😐 😊 $$$ $$ $$$ $ Own private data 😊 😐 😊 $$$ $$$ $$ $$ Olewave, a dataset and data solution provider, and a speech SDK and service provider. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024.
DIYed synthetic data and wild data 17 Dataset sources Data features Data c osts Cost Read data: record reading prompts. Simulated data: record in simulated scenario with given topic. Private data: not vendor’s data. Synthetic data: generated by AI. Web data: publicly available data Is it possible to get enough amount of In-domain data? Are transcrip-tions validated and good in quality? Are data from real-world scenario? Data unit cost per se Extra cost incurred by data Your development cost Interaction cost with vendor Time cost incurred by data delivery Engineering cost incurred by data Free 😐 😐 😐 - - - $$ O ff-shelf data ( read&simulated ) 😐 😐 😐 $ - - $ Customized r ead d ata 😊 😐 😐 $$ $$ $$ $ Customized simulated data 😊 😐 😊 $$$ $$ $$$ $ Own private data 😊 😐 😊 $$$ $$$ $$ $$ S ynthetic data, DIY 😊 😊 😖 😐 - - - $$ $$$ Wild data, DIY 😊 😖 😐 😊 - - - $$ $$$ Olewave, a dataset and data solution provider, and a speech SDK and service provider. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024.
Wild data offered by Olewave 18 Dataset sources Data features Data c osts Cost Read data: record reading prompts. Simulated data: record in simulated scenario with given topic. Private data: not vendor’s data. Synthetic data: generated by AI. Web data: publicly available data Is it possible to get enough amount of In-domain data? Are transcrip-tions validated and good in quality? Are data from real-world scenario? Data unit cost per se Extra cost incurred by data Your development cost Interaction cost with vendor Time cost incurred by data delivery Engineering cost incurred by data Free 😐 😐 😐 - - - $$ O ff-shelf data ( read&simulated ) 😐 😐 😐 $ - - $ Customized r ead d ata 😊 😐 😐 $$ $$ $$ $ Customized simulated data 😊 😐 😊 $$$ $$ $$$ $ Own private data 😊 😐 😊 $$$ $$$ $$ $$ S ynthetic data, DIY 😊 😊 😖 😐 - - - $$ $$$ Wild data, DIY 😊 😖 😐 😊 - - - $$ $$$ Wild data by Olewave 😊 😐 😊 😊 $ - - $ Olewave, a dataset and data solution provider, and a speech SDK and service provider. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024.
SPEECH DATASETs with transcriptions: 100 HOURS for only $1999 $2249 * Features: large-scale: up to millions of hours; collected from real-world scenarios; with validated written/verbatim transcriptions; multi languages and regions; various topics and scenarios Customization: Besides ASR, do you need other speech datasets for: meeting summarization? speech synthesis and voice cloning? not only training, but also evaluation? 19 * The price is for US English and general topic only. Delivery is f ee not included. The special rate is for ICASSP 2024. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Olewave, a dataset and data solution provider, and a speech SDK and service provider.
DATASETs FOR GENERATIVE AI and LLM Video datasets for Sora-like Generative AI 100 hours for only $4999 $5625 * Most resolutions are 1080p or above. Each video comes with textual infomation : Caption: ‘… their passionate love affair becomes a thrilling race for survival … ’ Transcription of the audio (optional): ‘You jump, I jump, right? [panting] [music]’ Meta data (comments, …): ‘To be honest, this scene is the true definition of love.’ … Text and Image Interleaved Datasets for GPT-4V-like Multimodal LLM 100K articles with images for only $2999 $3749 * Support multiple languages and topics: ‘healthcare articles in Spanish’ Periodically deliver updated documents: ‘all articles published in the Q1 2024’ Provide data source Provide source of the article : ‘https://leginfo.legislature.ca.gov/…/GENERAL%20PROVISIONS’ 20 * The price is for US English and general topic only. Delivery is f ee not included. The special rate is for ICASSP 2024. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Olewave, a dataset and data solution provider, and a speech SDK and service provider.
Olewave ’ s bespoke data solutions for private data labeling Jump Start We tailor a dataset of public data with labels in the client’s domain, and use our Tycho SDK to finetune a large (speech) model. Then deliver both the customized dataset and the refined model to client’s workspace. Auto Label Automatically label the client's private data with the finetuned model via Tycho SDK at the client's workspace. This omits human labelers, reducing labeling costs, and minimizing the risk of data breaches. Auto Iterate Automatically curate a dataset comprising labeled private and public data for additional fine-tuning, then iteratively leveraged the refined model to generate enhanced labels for private data. Launch Complete iteration, utilize well-labeled data for your product development with or without our Tycho SDK! We provide licensing and services for our Tycho SDK allowing you to label your data, and build your products. 21 “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Olewave, a dataset and data solution provider, and a speech SDK and service provider.
Wild data vs traditional data in fine-tuning 22 + We cannot share the experiment results of clients. So we simulate this experiment. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. The cost-effectiveness of our solutions is shown on Switchboard, an English conversational speech recognition task: + * WERR: Word Error Rate Reduction. §: Using Switchboard training data without labels to simulate the client’s private data. ^: Simulated a typical domain mismatch scenario by mixing 50-hr Fisher(in-domain) and 50-hr Librispeech (out-of-domain) data. Fisher dataset is very close to Switchboard dataset in data distribution. ¶: Simulated by using dataset’s transcriptions. ‡: Simulated by applying a typical human labeling error rate: 12.5%. Olewave, a dataset and data solution provider, and a speech SDK and service provider. The source of training d ata How data is being labeled ↓ WERR(%) ↓ Data c ost 100-hr private data from client § No human labeling, ASR pseudo-labeling -4.5 ~$0 Off-shelf data in client’s domain^ Human labeling ¶ 5.7 ~$15K 100-hr private data from client Human labeling with flawless accuracy 22.1 ~$75K 100-hr Private data from client Human labeling with reasonable accuracy ‡ 9.5 ~$25K Private data from client + 100-hr wild data from client’s domain by Olewave Accurately inferring labels of public data + Tycho SDK-driven advanced pseudo-labeling on private data 20.9 $2.5K
1. Dataset and data solution 2. Speech SDK: Tycho and supporting service “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Olewave, a dataset and data solution provider, and a speech SDK and service provider.
What is the best-known method(s) of developing high Return-on-Investment (ROI) speech-related services? 24 “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Olewave, a dataset and data solution provider, and a speech SDK and service provider.
25 “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024. Olewave, a dataset and data solution provider, and a speech SDK and service provider. Customizable SDK Options State-of-the-art? Service? Cost? ROI? DIY from scratches Depends No High Low Github /HF communities Not really No Low Low Nemo, SeamlessM4T, … Yes No^ Low Medium Other solution providers § Depends § Yes Depends Low-medium Tycho SDK by Olewave Yes and better * Yes Low-medium Medium-high ^: Unless you are a big customer. §: Some solution providers opt to develop their own in-house models rather than using state-of-the-art CC-BY large models, and they may lack the capability to acquire cost-efficient, in-domain data for their clients, unlike Olewave. *: We continuously integrate state-of-the-art features and models into our Tycho SDK. Return-on-Investment (ROI) analysis of developing speech-related services
TYCHO : OUR SDK FOR BUILDING YOUR high- roi IN-HOUSE SPEECH SERVICES: ASK US FOR LICENSE FEE AND SERVICE CHARGE * Professional and Trustworthy: With our cost-efficient datasets, promptly deliver domain-specific models through fine-tuning on: Large models released for research and commercial use Olewave’s own finetuned models. Support the use of medium-profile or low-profile GPU workstations to further save the cost. All the training, finetuning, evaluation, and deployment codes are available. Competitive service rate for: Top-notch fine-tuning your own specific domain model Fixing and improving your launched service by in-depth analyzing your bad cases Hands-on training for your ML engineers to work on speech R&D projects While Tycho currently focuses primarily on speech development, we plan to expand its capabilities to include additional modalities in the future! 26 Olewave, a dataset and data solution provider, and a speech SDK and service provider. We do not offer API. we can offer a demo system for you to try . Tycho is not an open-sourced SDK. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024.
Registered in California and headquartered in San Francisco https://www.olewave.com HARNESS UNTAMED DATA FOR HIGH ROI R&d 27 How can we assist you? Our exhibition booth number is E10 , we are right next to SPS lounge area. feel free to come if you want to know more about our datasets, our private data labeling solutions, our Tycho SDK, and our services. And feel free to send an email to [email protected] and see how we can assist you! Olewave, a dataset and data solution provider, and a speech SDK and service provider. “Harnessing Wild and Untamed Data for the Cost-Efficient Development of Customized Speech-Related Services”, ICASSP 2024.