Towards quality transcriptions of large limited domain archive data
jimregan
19 views
32 slides
Jun 04, 2024
Slide 1 of 32
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
About This Presentation
Towards quality transcriptions of large limited domain archive data
Talk given at Fonetik 2024, 4th of June
Size: 6 MB
Language: en
Added: Jun 04, 2024
Slides: 32 pages
Slide Content
Towards quality transcriptions of large limited domain archive data Jim O’Regan , Jens Edlund
Why? General speech science goals: how do people speak? Path towards better training materials, iterative training processes Make large audio collections searchable SweTerror : multidisciplinary investigation into parliamentary discourse around the topic of terrorism
How much data? September 2012 to January 2022: 5925 hours of raw video
How much data? September 2012 to January 2022: 5925 hours of raw video January 2000 to September 2012: (a similar amount, currently unprocessed)
How much data? September 2012 to January 2022: 5925 hours of raw video January 2000 to September 2012: (a similar amount, currently unprocessed) 1966 to 2000: TBC
How much data? September 2012 to January 2022: 5925 hours of raw video 9372 video files 2864 with no transcript
SweTerror motivation Parliamentary transcripts do not necessarily match what was spoken Speeches are filed in advance, but the speaker may react Transcription conventions
SweTerror motivation “ terrorstämplade ”
SweTerror motivation “ Turkiets antiterr - så kallade antiterrorlagstiftning ”
But wait! Didn’t OpenAI Whisper solve ASR?
Untrustworthy for our purposes Radford, A., Kim, J.W., Xu, T., Brockman, G., Mcleavey , C. & Sutskever , I.. (2023). Robust Speech Recognition via Large-Scale Weak Supervision, in Proceedings of Machine Learning Research 202:28492-28518 Available from https:// proceedings.mlr.press /v202/radford23a.html.
Disappearing “tack” 2442207060018256921 1 13.96 0.199 tack 1.0 <eps> ins 2442207060018256921 1 14.22 0.059 Herr 1.0 Herr cor 2442207060018256921 1 14.36 0.38 talman ! 1.0 talman ! cor (The official transcripts only start with “ Talman !” “Herr talman !” or “Fru talman !”)
Curious insertions 00:00.000 --> 00:30.000 Tack till mina supporters via www.patreon.com 00:30.000 --> 00:34.000 Tack till mina supporters via www.patreon.com 01:00.000 --> 01:04.000 Tack till mina supporters via www.patreon.com 01:30.000 --> 01:34.000 Tack till mina supporters via www.patreon.com 02:00.000 --> 02:04.000 Tack till mina supporters via www.patreon.com 02:30.000 --> 02:34.000 Tack till mina supporters via www.patreon.com 03:00.000 --> 03:04.000 Tack till mina supporters via www.patreon.com 03:30.000 --> 03:34.000 Tack till mina supporters via www.patreon.com
Alternative: KBLab’s VoxRex Common Voice WER (Word Error Rate) VoxRex: 8.49 Whisper large-v3: 8.3 Whisper large-v2: 10.1
VoxRex errors are opportunities
False starts 2442207180019978121 1 1619.5 1.579 oppooppositionens 1.0 oppositionens sub 2442207150019764521 1 213.3 0.839 bberor 1.0 beror sub 2442207160019915621 1 492.7 2.0 globalisglobaliseringen 1.0 globaliseringen sub
Alternative pronunciations 2442207160019915621 1 398.3 0.459 resvasion 1.0 reservation sub 2442207160019915621 1 432.52 0.48 resovationen 1.0 reservationen sub
Filled pauses 2442207150019764521 1 622.44 0.159 ifrån 1.0 från sub 2442207180020109821 1 326.1 0.099 nhär 1.0 här ins
Solution: use the intersection As VoxRex and Whisper are based on completely different principles, if they match, it’s a good indication that this was what was spoken
Solution: use the intersection As VoxRex and Whisper are based on completely different principles, if they match, it’s a good indication that this was what was spoken Treat Whisper similarly to the official transcripts
This work: get contiguous segments for ASR Three data sets: 100 hours of clean matches 200 hours of clean matches 49 hours of noisy matches
This work: get contiguous segments for ASR Three data sets: 100 hours of clean matches 200 hours of clean matches 49 hours of noisy matches (using text normalisation)
This work: get contiguous segments for ASR Three data sets: 100 hours of clean matches 200 hours of clean matches 49 hours of noisy matches (using text normalisation)
Results
Future work
Future work More recent past work
Future work More recent past work Extracting sentences (Quasi-)Phoneme-based recognition
Ongoing work More complex phrases
Some things were never actually said 2019-04-09 29 2442203180006309721 1 230.46 0.06 Jag 1.0 Jag cor 2442203180006309721 1 230.6 0.08 har 1.0 har cor 2442203180006309721 1 230.76 0.2 flera 1.0 flera cor 2442203180006309721 1 231.04 0.3 kollegor 1.0 kollegor ins 2442203180006309721 1 231.4 0.159 här 1.0 <eps> sub 2442203180006309721 1 231.76 0.02 i 1.0 i cor 2442203180006309721 1 232.08 0.399 kammaren 1.0 kammaren cor 2442203180006309721 1 232.479 0.0 <eps> 1.0 som del 2442203180006309721 1 232.479 0.0 <eps> 1.0 inte del 2442203180006309721 1 232.479 0.0 <eps> 1.0 kommer del 2442203180006309721 1 232.479 0.0 <eps> 1.0 från del 2442203180006309721 1 232.479 0.0 <eps> 1.0 Stockholm . del
Phrases move 2019-04-09 30 2442203180006309721 1 1596.22 0.099 ett 1.0 ett cor 2442203180006309721 1 1596.38 0.199 lopp 1.0 lopp cor 2442203180006309721 1 1596.579 0.0 <eps> 1.0 på del 2442203180006309721 1 1596.579 0.0 <eps> 1.0 60 del 2442203180006309721 1 1596.579 0.0 <eps> 1.0 mil del 2442203180006309721 1 1596.72 0.16 över 1.0 över cor 2442203180006309721 1 1597.0 0.119 tre 1.0 tre cor 2442203180006309721 1 1597.32 0.279 dagar 1.0 <eps> ins 2442203180006309721 1 1597.7 0.059 på 1.0 <eps> ins 2442203180006309721 1 1597.9 0.379 sextio 1.0 <eps> ins 2442203180006309721 1 1598.36 0.32 mil 1.0 dagar. sub
Things are added in the moment 2019-04-09 31 2442203180006309721 1 2324.96 0.039 vi 1.0 vi cor 2442203180006309721 1 2325.1 0.32 måste 1.0 <eps> ins 2442203180006309721 1 2325.52 0.159 höra 1.0 <eps> ins 2442203180006309721 1 2325.76 0.319 talas 1.0 <eps> ins 2442203180006309721 1 2326.12 0.039 om 1.0 <eps> ins 2442203180006309721 1 2326.22 0.08 den 1.0 <eps> ins 2442203180006309721 1 2326.34 0.099 här 1.0 <eps> ins 2442203180006309721 1 2326.48 0.5 historien 1.0 <eps> ins 2442203180006309721 1 2327.14 0.34 gång 1.0 gång cor 2442203180006309721 1 2327.58 0.039 på 1.0 på cor