36 Linguistics Across Disciplinary Borders
16 Some ASR-generated manual captions are recognizable due to characteristic text
formatting features: For example, one commercial service generates transcripts in
which the individual ‘words’ are two-character sequences. The visual effect of this
transcript formatting is that the individual words ‘unroll’ on the screen when the
corresponding video is viewed (e.g. aNeZBwGyFkk). This kind of transcript, which
would have very little overlap with a transcript comprised of standard lexical items,
would typically exhibit an ASR > 0.9. In this study, such transcripts have mostly
been filtered out by the processing steps described earlier.
References
Agarwal, S., S. Godbole, D. Punjani, and S. Roy (2007), ‘How Much Noise Is Too Much:
A Study in Automatic Text Classification’, in G. Jagannathan and R. N. Wright (eds),
Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM
2007), 3–12. https://doi .org /10 .1109 /ICDM .2007 .21.
Aksënova, A., D. van Esch, J. Flynn, and P. Golik (2021), ‘How Might We Create
Better Benchmarks for Speech Recognition?’, in Proceedings of the 1st Workshop
on Benchmarking: Past, Present and Future, 22–34, Stroudsburg: Association for
Computational Linguistics. https://doi .org /10 .18653 /v1 /2021 .bppf -1 .4.
Amodei, D., S. Anthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper,
B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A.
Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner,
L. Gao, C. Gong, A. Hannun, T. Han, L. Vaino Johannes, B. Jiang, C. Ju, B. Jun, P.
LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y.
Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S.
Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y.
Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B.
Yuan, J. Zhan, and Z. Zhu (2016), ‘Deep Speech 2: End-to-end Speech Recognition
in English and Mandarin’, in M. Balcan and K. Weinberger (eds), Proceedings of the
33rd International Conference on Machine Learning, Proceedings of Machine Learning
Research Vol. 48, 173–82, New York: Institute of Electrical and Electronics Engineers.
Babu, A., C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y.
Saraf, J. Pino, A. Baevski, A. Conneau, and M Auli (2021), ‘XLS-R: Self-supervised
Cross-lingual Speech Representation Learning at Scale’, arXiv:2111.09296v3 [cs.CL].
https://arxiv .org /abs /2111 .09296v3.
Baevski, A., H. Zhou, A. Mohamed, and M. Auli (2020), ‘wav2vec 2.0: A Framework for
Self-supervised Learning of Speech Representations’, arXiv:2006.11477v3 [cs.CL].
https://arxiv .org /abs /2006 .11477v3.
Chiu, C.-C., T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan,
R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani