Towards quality transcriptions of large limited domain archive data

jimregan 19 views 32 slides Jun 04, 2024
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Towards quality transcriptions of large limited domain archive data

Talk given at Fonetik 2024, 4th of June


Slide Content

Towards quality transcriptions of large limited domain archive data Jim O’Regan , Jens Edlund

Why? General speech science goals: how do people speak? Path towards better training materials, iterative training processes Make large audio collections searchable SweTerror : multidisciplinary investigation into parliamentary discourse around the topic of terrorism

How much data? September 2012 to January 2022: 5925 hours of raw video

How much data? September 2012 to January 2022: 5925 hours of raw video January 2000 to September 2012: (a similar amount, currently unprocessed)

How much data? September 2012 to January 2022: 5925 hours of raw video January 2000 to September 2012: (a similar amount, currently unprocessed) 1966 to 2000: TBC

How much data? September 2012 to January 2022: 5925 hours of raw video 9372 video files 2864 with no transcript

SweTerror motivation Parliamentary transcripts do not necessarily match what was spoken Speeches are filed in advance, but the speaker may react Transcription conventions

SweTerror motivation “ terrorstämplade ”

SweTerror motivation “ Turkiets antiterr - så kallade antiterrorlagstiftning ”

But wait! Didn’t OpenAI Whisper solve ASR?

Untrustworthy for our purposes Radford, A., Kim, J.W., Xu, T., Brockman, G., Mcleavey , C. & Sutskever , I.. (2023). Robust Speech Recognition via Large-Scale Weak Supervision, in Proceedings of Machine Learning Research 202:28492-28518 Available from https:// proceedings.mlr.press /v202/radford23a.html.

Disappearing “tack” 2442207060018256921 1 13.96 0.199 tack 1.0 <eps> ins 2442207060018256921 1 14.22 0.059 Herr 1.0 Herr cor 2442207060018256921 1 14.36 0.38 talman ! 1.0 talman ! cor (The official transcripts only start with “ Talman !” “Herr talman !” or “Fru talman !”)

Curious insertions 00:00.000 --> 00:30.000 Tack till mina supporters via www.patreon.com 00:30.000 --> 00:34.000 Tack till mina supporters via www.patreon.com 01:00.000 --> 01:04.000 Tack till mina supporters via www.patreon.com 01:30.000 --> 01:34.000 Tack till mina supporters via www.patreon.com 02:00.000 --> 02:04.000 Tack till mina supporters via www.patreon.com 02:30.000 --> 02:34.000 Tack till mina supporters via www.patreon.com 03:00.000 --> 03:04.000 Tack till mina supporters via www.patreon.com 03:30.000 --> 03:34.000 Tack till mina supporters via www.patreon.com

Alternative: KBLab’s VoxRex Common Voice WER (Word Error Rate) VoxRex: 8.49 Whisper large-v3: 8.3 Whisper large-v2: 10.1

VoxRex errors are opportunities

False starts 2442207180019978121 1 1619.5 1.579 oppooppositionens 1.0 oppositionens sub 2442207150019764521 1 213.3 0.839 bberor 1.0 beror sub 2442207160019915621 1 492.7 2.0 globalisglobaliseringen 1.0 globaliseringen sub

Alternative pronunciations 2442207160019915621 1 398.3 0.459 resvasion 1.0 reservation sub 2442207160019915621 1 432.52 0.48 resovationen 1.0 reservationen sub

Filled pauses 2442207150019764521 1 622.44 0.159 ifrån 1.0 från sub 2442207180020109821 1 326.1 0.099 nhär 1.0 här ins

Solution: use the intersection As VoxRex and Whisper are based on completely different principles, if they match, it’s a good indication that this was what was spoken

Solution: use the intersection As VoxRex and Whisper are based on completely different principles, if they match, it’s a good indication that this was what was spoken Treat Whisper similarly to the official transcripts

This work: get contiguous segments for ASR Three data sets: 100 hours of clean matches 200 hours of clean matches 49 hours of noisy matches

This work: get contiguous segments for ASR Three data sets: 100 hours of clean matches 200 hours of clean matches 49 hours of noisy matches (using text normalisation)

This work: get contiguous segments for ASR Three data sets: 100 hours of clean matches 200 hours of clean matches 49 hours of noisy matches (using text normalisation)

Results

Future work

Future work More recent past work

Future work More recent past work Extracting sentences (Quasi-)Phoneme-based recognition

Ongoing work More complex phrases

Some things were never actually said 2019-04-09 29 2442203180006309721 1 230.46 0.06 Jag 1.0 Jag cor 2442203180006309721 1 230.6 0.08 har 1.0 har cor 2442203180006309721 1 230.76 0.2 flera 1.0 flera cor 2442203180006309721 1 231.04 0.3 kollegor 1.0 kollegor ins 2442203180006309721 1 231.4 0.159 här 1.0 <eps> sub 2442203180006309721 1 231.76 0.02 i 1.0 i cor 2442203180006309721 1 232.08 0.399 kammaren 1.0 kammaren cor 2442203180006309721 1 232.479 0.0 <eps> 1.0 som del 2442203180006309721 1 232.479 0.0 <eps> 1.0 inte del 2442203180006309721 1 232.479 0.0 <eps> 1.0 kommer del 2442203180006309721 1 232.479 0.0 <eps> 1.0 från del 2442203180006309721 1 232.479 0.0 <eps> 1.0 Stockholm . del

Phrases move 2019-04-09 30 2442203180006309721 1 1596.22 0.099 ett 1.0 ett cor 2442203180006309721 1 1596.38 0.199 lopp 1.0 lopp cor 2442203180006309721 1 1596.579 0.0 <eps> 1.0 på del 2442203180006309721 1 1596.579 0.0 <eps> 1.0 60 del 2442203180006309721 1 1596.579 0.0 <eps> 1.0 mil del 2442203180006309721 1 1596.72 0.16 över 1.0 över cor 2442203180006309721 1 1597.0 0.119 tre 1.0 tre cor 2442203180006309721 1 1597.32 0.279 dagar 1.0 <eps> ins 2442203180006309721 1 1597.7 0.059 på 1.0 <eps> ins 2442203180006309721 1 1597.9 0.379 sextio 1.0 <eps> ins 2442203180006309721 1 1598.36 0.32 mil 1.0 dagar. sub

Things are added in the moment 2019-04-09 31 2442203180006309721 1 2324.96 0.039 vi 1.0 vi cor 2442203180006309721 1 2325.1 0.32 måste 1.0 <eps> ins 2442203180006309721 1 2325.52 0.159 höra 1.0 <eps> ins 2442203180006309721 1 2325.76 0.319 talas 1.0 <eps> ins 2442203180006309721 1 2326.12 0.039 om 1.0 <eps> ins 2442203180006309721 1 2326.22 0.08 den 1.0 <eps> ins 2442203180006309721 1 2326.34 0.099 här 1.0 <eps> ins 2442203180006309721 1 2326.48 0.5 historien 1.0 <eps> ins 2442203180006309721 1 2327.14 0.34 gång 1.0 gång cor 2442203180006309721 1 2327.58 0.039 på 1.0 på cor

Thank you!
Tags