Towards quality transcriptions of large limited domain archive data

Towards quality transcriptions of large limited domain archive data Jim O’Regan , Jens Edlund

Why? General speech science goals: how do people speak? Path towards better training materials, iterative training processes Make large audio collections searchable SweTerror : multidisciplinary investigation into parliamentary discourse around the topic of terrorism

How much data? September 2012 to January 2022: 5925 hours of raw video

How much data? September 2012 to January 2022: 5925 hours of raw video January 2000 to September 2012: (a similar amount, currently unprocessed)

How much data? September 2012 to January 2022: 5925 hours of raw video January 2000 to September 2012: (a similar amount, currently unprocessed) 1966 to 2000: TBC

How much data? September 2012 to January 2022: 5925 hours of raw video 9372 video files 2864 with no transcript

SweTerror motivation Parliamentary transcripts do not necessarily match what was spoken Speeches are filed in advance, but the speaker may react Transcription conventions

SweTerror motivation “ terrorstämplade ”

SweTerror motivation “ Turkiets antiterr - så kallade antiterrorlagstiftning ”

But wait! Didn’t OpenAI Whisper solve ASR?

Untrustworthy for our purposes Radford, A., Kim, J.W., Xu, T., Brockman, G., Mcleavey , C. & Sutskever , I.. (2023). Robust Speech Recognition via Large-Scale Weak Supervision, in Proceedings of Machine Learning Research 202:28492-28518 Available from https:// proceedings.mlr.press /v202/radford23a.html.

Disappearing “tack” 2442207060018256921 1 13.96 0.199 tack 1.0 <eps> ins 2442207060018256921 1 14.22 0.059 Herr 1.0 Herr cor 2442207060018256921 1 14.36 0.38 talman ! 1.0 talman ! cor (The official transcripts only start with “ Talman !” “Herr talman !” or “Fru talman !”)

Curious insertions 00:00.000 --> 00:30.000 Tack till mina supporters via www.patreon.com 00:30.000 --> 00:34.000 Tack till mina supporters via www.patreon.com 01:00.000 --> 01:04.000 Tack till mina supporters via www.patreon.com 01:30.000 --> 01:34.000 Tack till mina supporters via www.patreon.com 02:00.000 --> 02:04.000 Tack till mina supporters via www.patreon.com 02:30.000 --> 02:34.000 Tack till mina supporters via www.patreon.com 03:00.000 --> 03:04.000 Tack till mina supporters via www.patreon.com 03:30.000 --> 03:34.000 Tack till mina supporters via www.patreon.com

Alternative: KBLab’s VoxRex Common Voice WER (Word Error Rate) VoxRex: 8.49 Whisper large-v3: 8.3 Whisper large-v2: 10.1

VoxRex errors are opportunities

False starts 2442207180019978121 1 1619.5 1.579 oppooppositionens 1.0 oppositionens sub 2442207150019764521 1 213.3 0.839 bberor 1.0 beror sub 2442207160019915621 1 492.7 2.0 globalisglobaliseringen 1.0 globaliseringen sub

Alternative pronunciations 2442207160019915621 1 398.3 0.459 resvasion 1.0 reservation sub 2442207160019915621 1 432.52 0.48 resovationen 1.0 reservationen sub

Filled pauses 2442207150019764521 1 622.44 0.159 ifrån 1.0 från sub 2442207180020109821 1 326.1 0.099 nhär 1.0 här ins

Solution: use the intersection As VoxRex and Whisper are based on completely different principles, if they match, it’s a good indication that this was what was spoken

Solution: use the intersection As VoxRex and Whisper are based on completely different principles, if they match, it’s a good indication that this was what was spoken Treat Whisper similarly to the official transcripts

This work: get contiguous segments for ASR Three data sets: 100 hours of clean matches 200 hours of clean matches 49 hours of noisy matches

This work: get contiguous segments for ASR Three data sets: 100 hours of clean matches 200 hours of clean matches 49 hours of noisy matches (using text normalisation)

Results

Future work

Future work More recent past work

Future work More recent past work Extracting sentences (Quasi-)Phoneme-based recognition

Ongoing work More complex phrases

Some things were never actually said 2019-04-09 29 2442203180006309721 1 230.46 0.06 Jag 1.0 Jag cor 2442203180006309721 1 230.6 0.08 har 1.0 har cor 2442203180006309721 1 230.76 0.2 flera 1.0 flera cor 2442203180006309721 1 231.04 0.3 kollegor 1.0 kollegor ins 2442203180006309721 1 231.4 0.159 här 1.0 <eps> sub 2442203180006309721 1 231.76 0.02 i 1.0 i cor 2442203180006309721 1 232.08 0.399 kammaren 1.0 kammaren cor 2442203180006309721 1 232.479 0.0 <eps> 1.0 som del 2442203180006309721 1 232.479 0.0 <eps> 1.0 inte del 2442203180006309721 1 232.479 0.0 <eps> 1.0 kommer del 2442203180006309721 1 232.479 0.0 <eps> 1.0 från del 2442203180006309721 1 232.479 0.0 <eps> 1.0 Stockholm . del

Phrases move 2019-04-09 30 2442203180006309721 1 1596.22 0.099 ett 1.0 ett cor 2442203180006309721 1 1596.38 0.199 lopp 1.0 lopp cor 2442203180006309721 1 1596.579 0.0 <eps> 1.0 på del 2442203180006309721 1 1596.579 0.0 <eps> 1.0 60 del 2442203180006309721 1 1596.579 0.0 <eps> 1.0 mil del 2442203180006309721 1 1596.72 0.16 över 1.0 över cor 2442203180006309721 1 1597.0 0.119 tre 1.0 tre cor 2442203180006309721 1 1597.32 0.279 dagar 1.0 <eps> ins 2442203180006309721 1 1597.7 0.059 på 1.0 <eps> ins 2442203180006309721 1 1597.9 0.379 sextio 1.0 <eps> ins 2442203180006309721 1 1598.36 0.32 mil 1.0 dagar. sub

Things are added in the moment 2019-04-09 31 2442203180006309721 1 2324.96 0.039 vi 1.0 vi cor 2442203180006309721 1 2325.1 0.32 måste 1.0 <eps> ins 2442203180006309721 1 2325.52 0.159 höra 1.0 <eps> ins 2442203180006309721 1 2325.76 0.319 talas 1.0 <eps> ins 2442203180006309721 1 2326.12 0.039 om 1.0 <eps> ins 2442203180006309721 1 2326.22 0.08 den 1.0 <eps> ins 2442203180006309721 1 2326.34 0.099 här 1.0 <eps> ins 2442203180006309721 1 2326.48 0.5 historien 1.0 <eps> ins 2442203180006309721 1 2327.14 0.34 gång 1.0 gång cor 2442203180006309721 1 2327.58 0.039 på 1.0 på cor

Thank you!

Towards quality transcriptions of large limited domain archive data

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Towards quality transcriptions of large limited domain archive data

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

TLE-9-Prepare-Salad-and-Dressing.pptxkkk

LESSON 1 ABOUT MEDIA AND INFORMATION.pptx

GRADE-8-AQUACULTURE-WEEKQ1.pdfdfawgwyrsewru

Feelings PP Game FOR CHILDREN IN ELEMENTARY SCHOOL.pptx

Jeopardy_Figures_of_Speech_Template.pptx [Autosaved].pptx

Jeopardy_Figures_of_Speech.pptxvdsvdsvsdvsd