Linguistics Across Disciplinary Borders The March Of Data Steven Coats

ngwakavegel 0 views 81 slides May 24, 2025
Slide 1
Slide 1 of 81
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81

About This Presentation

Linguistics Across Disciplinary Borders The March Of Data Steven Coats
Linguistics Across Disciplinary Borders The March Of Data Steven Coats
Linguistics Across Disciplinary Borders The March Of Data Steven Coats


Slide Content

Linguistics Across Disciplinary Borders The
March Of Data Steven Coats download
https://ebookbell.com/product/linguistics-across-disciplinary-
borders-the-march-of-data-steven-coats-55933508
Explore and download more ebooks at ebookbell.com

Here are some recommended products that we believe you will be
interested in. You can click the link to download.
Linguistics Across Disciplinary Borders Steven Coatsveronika Laippala
https://ebookbell.com/product/linguistics-across-disciplinary-borders-
steven-coatsveronika-laippala-57912314
Turkish Linguistics Across Boundaries The Adana Meeting Hatice Sofu
Editor
https://ebookbell.com/product/turkish-linguistics-across-boundaries-
the-adana-meeting-hatice-sofu-editor-32734484
Language Across Boundaries British Studies In Applied Linguistics
Janet Cotterill
https://ebookbell.com/product/language-across-boundaries-british-
studies-in-applied-linguistics-janet-cotterill-2114286
Metaphor And Metonymy Across Time And Cultures Perspectives On The
Sociohistorical Linguistics Of Figurative Language Javier E Dazvera
Editor
https://ebookbell.com/product/metaphor-and-metonymy-across-time-and-
cultures-perspectives-on-the-sociohistorical-linguistics-of-
figurative-language-javier-e-dazvera-editor-50929878

Culture Body And Language Conceptualizations Of Internal Body Organs
Across Cultures And Languages Applications Of Cognitive Linguistics
Farzad Sharifian
https://ebookbell.com/product/culture-body-and-language-
conceptualizations-of-internal-body-organs-across-cultures-and-
languages-applications-of-cognitive-linguistics-farzad-
sharifian-1827408
Linguistic Politeness Across Boundaries The Case Of Greek And Turkish
Arin Bayraktarolu Ed
https://ebookbell.com/product/linguistic-politeness-across-boundaries-
the-case-of-greek-and-turkish-arin-bayraktarolu-ed-1647604
Language Acquisition Across Linguistic And Cognitive Systems Language
Acquisition And Language Disorders Dr Michle Kail
https://ebookbell.com/product/language-acquisition-across-linguistic-
and-cognitive-systems-language-acquisition-and-language-disorders-dr-
michle-kail-2259988
Language Youth And Identity In The 21st Century Linguistic Practices
Across Urban Spaces Jacomine Nortier
https://ebookbell.com/product/language-youth-and-identity-in-the-21st-
century-linguistic-practices-across-urban-spaces-jacomine-
nortier-6717586
Linguistic Relativity Evidence Across Languages And Cognitive Domains
Caleb Everett
https://ebookbell.com/product/linguistic-relativity-evidence-across-
languages-and-cognitive-domains-caleb-everett-50985184

Linguistics across Disciplinary Borders

Language, Data Science and Digital Humanities
Series editors: Mikko Laitinen (University of Eastern Finland, Finland) and
Jukka Tyrkkö (Linnaeus University, Sweden)
The growing availability of computer-readable language data, increasing
computational power and rapidly evolving statistical methodologies have had
a profound effect on how scholars study and analyse human language use.
However, the fields of linguistics, computer science and digital humanities have
largely developed their own separate approaches and paradigms, often failing
to communicate across disciplines in an effective way.
Language, Data Science and Digital Humanities bridges these disciplinary
gaps by publishing monographs and edited volumes that explore disciplinary
synergies and introduce new theoretical principles. Written in clear and
transparent language, these books offer cutting-edge digital methodologies
and create new opportunities for understanding how problems and research
questions can be approached from different perspectives.
The methodological range of the series covers empirical linguistics, natural
language processing, machine learning, data visualization, text mining,
mark-up and annotation, statistical tools in analysing language data, and
multimodal analysis. The volumes explain methodological solutions in detail
using worked examples, and are supported by companion websites, allowing
authors to share primary data, scripts, sophisticated data visualizations and
other digital content.
Editorial Board
Jennifer Edmond, Associate Professor of Digital Humanities (Trinity College
Dublin, Ireland)
Jacob Eisenstein, Assistant Professor of Computational Linguistics
(Georgia Institute of Technology, USA)
Bruno Gonçalves, Vice President of Data Science and Finance/Fellow
(JP Morgan Chase & Co/ISI Foundation, Italy)
Jack Grieve, Professor of Corpus Linguistics (University of Birmingham, UK)
Martin Hilpert, Assistant Professor of English Linguistics (Université de
Neuchâtel, Switzerland)
Andreas Kerren, Professor of Computer Science (Linnaeus University, Sweden)
Haidee Kotze, Professor of Translation Studies (KU Leuven, Belgium)
Krister Linden, Adjunct Professor of Language Technology (University of
Helsinki, Finland)

Dong Nguyen, Visiting Research of Text Mining Methods (Alan Turing
Institute, UK)
Kari-Jouko Räihä, Professor of Computer Science (Tampere University,
Finland)
Ute Römer, Assistant Professor of Applied Linguistics and English as a
Second Language (Georgia State University, USA)
David Shepard, Professor of Germanic Languages, Comparative Literature,
and Digital Humanities (University of California Los Angeles, USA)
Benedikt Szmrecsanyi, Associate Professor of Linguistics
(KU Leuven, Belgium)
Upcoming title in the series
Text Analytics for Corpus Linguistics and Digital Humanities, Gerold Schneider

Linguistics across Disciplinary
Borders
The March of Data
Edited by
Steven Coats and Veronika Laippala

BLOOMSBURY ACADEMIC
Bloomsbury Publishing Plc
50 Bedford Square, London, WC1B 3DP, UK
1385 Broadway, New York, NY 10018, USA
29 Earlsfort Terrace, Dublin 2, Ireland
BLOOMSBURY, BLOOMSBURY ACADEMIC and the Diana logo are trademarks of
Bloomsbury Publishing Plc
First published in Great Britain 2024
Copyright © Steven Coats, Veronika Laippala and Contributors, 2024
Steven Coats, Veronika Laippala and Contributors have asserted their right under the
Copyright, Designs and Patents Act, 1988, to be identified as Authors of this work.
Cover design: Elena Durey
Cover image © White Space Illustrations/Shutterstock
All rights reserved. No part of this publication may be reproduced or transmitted in any
form or by any means, electronic or mechanical, including photocopying, recording,
or any information storage or retrieval system, without prior permission in writing
from the publishers.
Bloomsbury Publishing Plc does not have any control over, or responsibility for, any third-
party websites referred to or in this book. All internet addresses given in this book were
correct at the time of going to press. The author and publisher regret any inconvenience
caused if addresses have changed or sites have ceased to exist, but can accept no
responsibility for any such changes.
A catalogue record for this book is available from the British Library.
A catalog record for this book is available from the Library of Congress.
ISBN: HB: 978-1-3503-6226-0
ePDF: 978-1-3503-6227-7
eBook: 978-1-3503-6228-4
Series: Language, Data Science and Digital Humanities
Typeset by Deanta Global Publishing Services, Chennai, India
To find out more about our authors and books visit www .bloomsbury .com and sign up for
our newsletters.
Online resources to accompany this book are available at https://www .blo omsb
uryo nlin eres ources .com /language -data -science -and -digital -humanities. If you
experience any problems, please contact Bloomsbury at: onlineresources @
bloomsbury . com

Contents
List of Figures ix
List of Tables xi
Introduction Steven Coats and Veronika Laippala 1
Part I Methods for Data Collection, Analysis and Visualization
1 Noisy Data: Using Automatic Speech Recognition Transcripts for
Linguistic Research Steven Coats
17
2 Low-code Data Science Tools for Linguistics: Swiss Army Knives
or Pretty Black Boxes? Jukka Tyrkkö and Daniel Ihrmark
40
3 The Visualization and Evaluation of Semantic and Conceptual
Maps Gerold Schneider
67
Part II Corpus Construction, Registers and Genres
4 Towards Automatic Register Classification in Unrestricted
Databases of Historical English Liina Repo, Brett Hashimoto, Aatu
Liimatta, Lassi Saario, Tanja Säily, Iiro Tiihonen, Mikko Tolonen
and Veronika Laippala
97
5 The Topical Landscape of Web Registers: Exploring the Interplay
of Registers and Topicality in a Web-scale Corpus Valtteri Skantsi,
Veronika Laippala and Aki-Juhani Kyröläinen
127
6 Towards ‘Large and Tidy’: Establishing Internal Structure in Mega-
corpora Axel Bohmann
157
Part III Social Media, Discourse and Meanings
7 Multi-modal Considerations for Social Media Discourse Analysis:
A Specialized Corpus of Twitter Commentary on ‘Working from
Home’ Christopher Fitzgerald, Geraldine Mark, Anne O’Keeffe,
Dawn Knight, Justin McNamara, Svenja Adolphs, Benjamin Cowan,
Tania Fahey Palma, Fiona Farr and Sandrine Peraldi
187

viii Contents
8 Exploring Self-identification and the Functions of the Identify as
Construction in the LGBTQ+ Reddit Corpus Laura Hekanaho,
Turo Hiltunen, Minna Palander-Collin and Helmiina Hotti
213
List of Contributors 243
Index 249

Figures
1.1 Theoretically equivalent transcripts with high WER 24
1.2 Distribution of WER values for videos from four corpora 27
1.3 Confusion matrices for classifiers trained with manual transcripts
and ASR transcripts
28
1.4 Most important features in the models trained with manual
transcripts and ASR transcripts
29
1.5 Word types with highest G scores for manual transcripts and
ASR transcripts
31
2.1 First four nodes of a sample workflow 47
2.2 Popup window showing options for the ‘Pivoting’ node 49
2.3 Concordancer workflow with visualization and hypothesis test 50
2.4 Visualization of the frequency distribution by gender using
KNIME’s Violin Plot node
51
2.5 Topic modelling workflow 55
2.6 Heatmap showing twelve files and ten topics 55
2.7 Heatmap showing twelve files and ten topics, stop words included 56
3.1 Performance and confusion matrix of document classification 75
3.2 Top features for the 1940s and 1990s 75
3.3 Perplexity by number of topics 77
3.4 Two-dimensional projection with t-SNE of the semantic space
of COHA news from the 1940s, with 500 words
81
3.5 Excerpt of a t-SNE map from COHA 1940s, with 5,000 words and
word2vec embeddings
82
3.6 COHA news section, excerpt of conceptual map 83
3.7 Conceptual map of the 1940s from COHA news in overview 84
3.8 Excerpt of conceptual map of the 1940s from COHA news
focusing on the Second World War
85
3.9 Annotation of the COHA 1940s conceptual map by a second
annotator
86
3.10 Map excerpts demonstrating evaluation criterion 1: Internal
coherence
88

x Figures
3.11 Map excerpt demonstrating evaluation criterion 2: External
coherence
89
3.12 Map demonstrating evaluation criterion 3: Global coherence 89
4.1 Proportion of hand-coded registers in the subsamples of GKL
and COFEA
106
4.2 Heatmap of a confusion matrix presenting classification results
with ECCO-BERT
111
5.1 The relationship between semantic coherence and number of topics 136
5.2 The keywords of the most frequent topics in the topic data of
Parsebank
138
5.3 Frequency distribution of the topics in the Parsebank topic
data
140
5.4 Typical topic associated with lyrical 141
5.5 Typical topic associated with spoken 142
5.6 Typical topics associated with informational persuasion 143
5.7 Typical topics associated with narrative 144
5.8 Document topic distribution associated with Example 7 145
5.9 Topical landscape of the Web registers 148
5.10 Bootstrap estimated stability of the topical landscape of the
Web registers
149
6.1 Score distributions along dimension 1 by ICE text category 169
6.2 Score distributions along dimension 2 by ICE text category 172
6.3 Score distributions along dimension 3 by ICE text category 173
6.4 Score distributions along dimension 4 by ICE text category 175
6.5 Score distributions along all four dimensions for the blog
and general texts in GloWbE
176
6.6 Coefficient estimates for two models predicting choice of
FTR device based on 4,299,620 tokens from GloWbE
178
7.1 Broad thematic categories of tweets 199
7.2 Sub-themes of comment on WFH tweets 199
7.3 Sub-themes of social comment tweets 200

Tables
1.1 Number of ASR and manual transcripts by corpus 23
1.2 Mean and median WERs by corpus 26
1.3 Summary of classification results 28
3.1 Mallet topic model output 78
3.2 Labels given to the topics by two annotators 79
3.3 Lexical items closest to four selected terms in the ‘news’
subsection of COHA
80
3.4 Lexical items closest to 1940 in the ‘news’ subsection of COHA 81
3.5 Overlap in annotation by two annotators 87
4.1 Classification results with micro and macro-averaged F1-values 108
4.2 F1-scores for identified registers in GKL and COFEA 109
4.3 F1-scores for unidentified registers in GKL and COFEA 110
5.1 The Hierarchical register taxonomy included in FinCORE
mirroring the taxonomy of CORE
130
5.2 Register-specific discriminability on FinCORE test data 131
5.3 Register distribution in the FinCORE training set and in Parsebank
based on the predicted register classes
132
5.4 Distribution of registers in the topic data 135
5.5 Topic type frequencies in the Parsebank topic data 139
5.6 Topic diversity of the Web registers in the topic data of Parsebank 146
6.1 The Sampling frame of the International Corpus of English 165
6.2 Salient structure coefficients for Dimension 1 168
6.3 Salient structure coefficients for Dimension 2 170
6.4 Salient structure coefficients for Dimension 3 172
6.5 Salient structure coefficients for Dimension 4 174
7.1 Numbers and dates of tweets extracted to form the WFHTC 192
7.2 Breakdown of content in tweets with extratextual media in
the WFHTC
197
7.3 Themes, sub-themes and example tweets of WFHTC 198
7.4 Sentiment categorization of tweets 202
8.1 Multiword sketch for identify as in enTenTen20 214
8.2 Statistics of the LGBTQ+ Reddit corpus 218

xii Tables
8.3 Keyword analysis of the LGBTQ+ reddit corpus 219
8.4 Subject tokens, spelling normalized 222
8.5 Subject reference 223
8.6 Complement labels, spelling normalized 224
8.7 Label category 225
8.8 Textual functions 227
8.9 Subreddits included in the LGBTQ+ Reddit corpus 240
8.10 Subreddits included in the analysed sample 241

Introduction
Steven Coats and Veronika Laippala
Increasing digitization has made linguistic data more accessible than ever
before. Thereby, the internet, and the many communication and social media
platforms it hosts, has become an important source of data for researchers
working in various disciplines that utilize linguistic material to study how
information is transmitted and spread around the world: how people interact,
communicate and understand aspects of the world around them and how
languages and language varieties themselves may be developing and changing
due to new communication habits and technological affordances. Similarly, the
digitization of historical resources has enabled new methods of analysis for our
understanding of the past. Instead of focusing on a handful of examples that
can be qualitatively examined, researchers can use new, digitized databases of
historical sources for quantitative examination of large-scale tendencies and
changes occurring over the course of centuries.
However, the vast quantities of available digital data call for the application of
novel methods and pose new analytical challenges. The ‘march of data’ denotes
this area at the border region where linguistics, humanities, social science and
information technology overlap. It also refers to the ongoing development
of underlying technologies, the generation of new data sources, and the
application of new methodological approaches: all of these are factors that drive
analysis in these fields. Some techniques are necessary in order to guarantee
the reliability and validity of analyses conducted upon large-scale datasets.
Concordancing, keyword analysis, and statistical methods, for example, have
been applied in corpus linguistics for decades to ensure the reliable and efficient
analysis of corpora (see Biber 1988; Dunning 1993; Scott 1997). New methods
and technologies are constantly developed, allowing researchers to push the
envelope of what is possible and take the steps necessary to tame corpora that
are larger, more diverse and sometimes also ‘noisier’, in the sense of containing
errors or other unwanted content, than their predecessors. New techniques for
visualization and topic discovery and evolving machine-learning methods, for

2 Linguistics Across Disciplinary Borders
example, provide researchers with novel tools to examine language data in ways
we did not know existed a decade or two ago.
In addition to novel technologies, the march of data entails new kinds of
settings for language use. Social media platforms, such as Twitter, YouTube or
Reddit, typically offer similar communicative affordances to users, but differ
in terms of focus and user base. Similarly, conversations, channels, videos or
user-created spaces on a single social media platform can exhibit a wide range
of discursive practices. For the researcher, the diversity of computer-mediated
communication (CMC) formats, modalities and conventions means that
research hypotheses, methodological frameworks and analytical approaches
must be carefully considered so that content can be analysed and interpreted
appropriately. In the formulation of Maslow (1966), ‘if all you have is a hammer,
everything is a nail’, and for researchers working with language data, a similar
consideration prevails: analyses and interpretations of online content, from
social media or other modalities, need to be conducted with an eye to the
characteristics of the underlying discourse and communicative situation, and
not with a single analytical method as a hammer. Without considering, for
example, register (genre) and user community differences inherent to different
social media content, we risk misunderstanding what is going on in terms of
language use and drawing incorrect conclusions. This book highlights a number
of approaches for the analysis and interpretation of language data. It focuses
on three principal aspects of the march of data: the use of novel methods for
data collection, processing, and analysis; the development of new methods for
automatic annotation of corpora in order to make them more useful; and the
understanding of specific discourses and identity-making practices on social
media platforms. In the following, before presenting the chapters, we discuss
central concepts related to these three trends, as well as the advantages they offer
and the challenges they pose for the analyst.
Methods for data collection, analysis and visualization
In corpus linguistics, various statistical methods have been applied for decades
in order to obtain quantitative evidence for particular language phenomena.
For instance, multidimensional analyses, introduced by Doug Biber (Biber
1988; Berber Sardinha and Veirano Pinto 2019), applies factor analysis to
identify co-occurring linguistic patterns, or dimensions, in the data, and
then interprets these dimensions functionally to describe registers, that is,

3Introduction
situationally defined text categories such as news, lyrical texts or fiction (Biber
1988). Similarly, keyword analysis identifies important words in a corpus in
comparison to a reference corpus by using measures such as log-likelihood or
chi-squared test statistic values (Scott 1997; Stubbs and Tribble 2006; Egbert
and Biber 2019).
Furthermore, advances in machine learning and natural language processing
(NLP) have provided novel methods that benefit linguistics and other
disciplines using linguistic data. Specifically, methods based on supervised and
unsupervised machine learning are widely used in language processing, as are
word embeddings, that is, distributed word representations learned from large
amounts of linguistic data.
Typical applications of supervised machine learning include text classification,
such as classifying emails as spam or not spam, or sentiment analysis, in which
texts are classified as neutral, positive or negative. The basis of supervised
machine learning lies in a set of input observations, such as texts, that are linked
to outputs, such as sentiment labels. An algorithm then learns to map the input
texts to their associated output labels. To do this, it builds a model on the basis
of the input data, specifically on a set of training texts to which the correct labels
have been manually annotated. This model can then be used to predict the labels
for new texts (Jurafsky and Martin 2021, chapter 4).
Different kinds of text classification applications can be useful for the further
analysis of linguistic data – for instance, as we will explain in the following section,
register identification can provide more information on the texts included in
large and noisy datasets. Another application based on supervised machine
learning that is frequently used in text analysis is syntactic parsing, in which text
is segmented into sentences and words, and analysed on the basis of syntactic
and morphological structure, for example, by reducing words to their base forms
or lemmas (Jurafsky and Martin 2021, chapters 13–14; de Marneffe et al. 2021).
Similarly, automatic speech recognition (ASR) is based on training data – input-
output pairs of recorded speech and text in which an audio segment has been
labelled with a word or a phoneme (Manning and Martin 2021, chapter 21). In
this book, ASR is discussed in the chapter by Coats.
Whereas supervised machine learning is based on learning the mapping
between inputs and their correct outputs from training data, unsupervised
machine learning learns from unlabelled data by identifying similar patterns
or instances and then grouping them together. In text mining, a frequently
applied method based on unsupervised machine learning is topic modelling
(Blei et al. 2003). This technique can identify hidden topics in collections

4 Linguistics Across Disciplinary Borders
of texts based on co-occurring patterns of words. In this approach, topics
are defined as distributions of particular words, and texts as distributions of
particular topics. Topics can then be analysed, for example, on the basis of their
most probable words, whose semantics likely reflect the topic’s underlying
thematic content in a plausible manner. Additionally, as each text is associated
with probabilities for each topic, the method allows the analysis to focus on
texts with high scores on specific topics, facilitating the processing and filtering
of large-scale datasets.
Topic modelling is widely applied in many disciplines working on large
digital linguistic data, such as social sciences, communication studies and
computational psychology (see Roberts, Stewart and Airoldi 2016; Boyd and
Schwartz 2021; Boumans and Trilling 2016 for overviews). In linguistics,
however, the method has received criticism as well (Brookes and McEnery
2019). In this book, the method is discussed in several chapters that shed light
on its benefits and possible shortcomings.
Like many pre-neural machine-learning methods, topic modelling is based on
feature frequencies calculated from textual data. Whereas these representations
can capture many aspects of meaning and achieve good performance on many
machine-learning tasks, they are still limited, because texts are represented as
collections of frequencies of word types or other features, an atomistic approach
often referred to as ‘bag-of-words’ language modelling. Relatively recent
developments in computer science and NLP, however, in particular the use
of neural networks, have led to more sophisticated methods being developed.
Specifically, text can be represented using embeddings based on distributed
representations, that is, vectors of numerical values created from the usage
contexts of words or other linguistic elements in large amounts of linguistic data.
The underlying idea behind word embeddings is the distributional hypothesis.
Similar words, in terms of semantics as well as grammatical functions, tend to
occur in similar contexts (Harris 1954). In Firth’s famous dictum, ‘you shall
know a word by the company it keeps’ ([1957]1968, 178). However, while
raw co-occurrence frequencies can be used for NLP tasks, they are unwieldy
to manipulate. The well-known word2vec algorithm (Mikolov et al. 2013)
creates dense vector representations of words algorithmically by using machine
learning, resulting in advantages in terms of accuracy and computational cost.
Specifically, using very large datasets of raw text, the algorithm generates the
word embeddings based on the aggregate usage contexts of the target word,
resulting in semantically similar words having vectors that are closer to each
other in multidimensional vector space. Importantly, as the embeddings are

5Introduction
learned from free text, no manual work is needed to annotate the training data
(See Jurafsky and Martin 2021, chapter 6).
Word embeddings are brought to the next level by more recent and much
more complex neural architectures, such as BERT (Bidirectional Encoder
Representations from Transformers; Devlin et al. 2018). In a static embedding
model such as word2vec, each word receives one embedding. In BERT-
style architectures, embeddings vary according to the context in which the
target word is used. This better enables the language model to distinguish the
meanings associated with homonyms (such as ‘lead’, the soft metal, and ‘lead’, a
verb meaning to go in front or to govern). Similar to static embeddings, they are
typically trained using massive amounts of Web data.
Downstream NLP tasks such as text classification can benefit from contextual
embeddings by fine-tuning a language model with manually annotated training
data. This allows the system to create specific contextual data representations
and typically results in better performance (see Jurafsky and Martin 2021,
chapter 11). In this book, this is shown in chapters by Repo et al. and Skantsi
et al., in which BERT and other, similar architectures have been fine-tuned for
various purposes.
Corpus construction, registers and genres
Ready-made, carefully compiled corpora have been a central part of corpus
linguistics since the 1960s, the Brown Corpus (Francis and Kucera 1967) being
the first to present researchers a balanced, representative and well-documented
collection of linguistic data in digital format. Since then, numerous new corpora
have been compiled using the same procedures as were used to create the Brown
Corpus; in addition, new kinds of corpora have been created in order to provide
researchers the opportunity to examine language from various perspectives,
including national or regional variety or historical era (for English, see, for
example, the resources at https://english -corpora .org).
However, digitization has created vast amounts of linguistic data in digital
formats that differ dramatically from these carefully planned collections. The
internet provides access to language data from a wide variety of usage contexts,
providing new opportunities both for linguistically oriented web-as-corpus
research (Kilgarriff and Grefenstette 2003; Kilgarriff 2007) as well as for technical
innovations, where Web data are used for the development of language models
and thus more efficient and higher quality natural-language processing systems

6 Linguistics Across Disciplinary Borders
(Devlin et al. 2018; Conneau et al. 2020). Already the size of Web-derived data
sets distinguishes them from corpora typically employed in corpus linguistics: the
Oscar dataset developed for natural language processing, for example, comprises 6.3
terabytes of data in more than 100 languages (Ortiz Suarez et al. 2019). Processing
datasets of this magnitude requires computational resources and dedicated
technologies, as well as new kinds of analytical approaches. For the comparison
of word frequencies in text corpora, dispersions, or effect size measures, for
example, have been suggested to be more useful measures than standard statistical
hypothesis tests such as the chi-squared test, which for very large corpora will tend
to always result in p-values that indicate significant difference (Gries 2005).
In addition to size, another major challenge posed by Web-derived datasets
is their noisiness and lack of metadata. Even though automatically crawled
datasets are typically cleaned from ‘boilerplate’ (i.e. template content on a web
page such as headings or indexing elements) with tools such as Trafilatura
(Barbaresi 2021), they can still include fragments of code, HTML artifacts or
language content that are not part of the main text of the page (see the evaluation
in Laippala et al. 2022). Furthermore, and crucially, text that has been collected
automatically from websites may not include quantifiable information as to its
origin. In a dataset created from web crawling, all documents may have an equal
status independent of their genre or register (see Biber 1988), regardless if they
originate from user manuals, legislation, news reports, blogs or other sources.
This fact can present challenges for the ensuing analysis: because language use
tends to vary according to situational context and register, information about
a web-harvested document’s source and communicative functions is crucial to
understand its message (see Biber 2012, Biber and Conrad 2009).
Some studies in Web genre or Web register identification have aimed to create
Web corpora in which genre or register categories are annotated (e.g. ‘news’,
‘blog’, or ‘company website’). However, the task has been unexpectedly difficult
because of the range of linguistic variation found online. The extent to which web-
based text content comprises specific genres or registers is not known, and even
if it were, these categories may not always be discrete or well-defined, making
their identification difficult (Santini et al. 2021; Sharoff et al. 2010; Biber et al.
2020). The most recent studies have, however, shown promising results: several
large Web register corpora with manual annotations, representing the whole
range of register categories found on the Web, have been published together
with encouraging results on their automatic identification (e.g. Repo et al. 2021;
Laippala et al. 2022). In this book, Web registers and genres are discussed in the
chapters by Bohmann and by Skantsi et al.

7Introduction
‘Born digital’ web content is a significant data source for researchers
working with language material, but there are others as well. Digitized, OCR-
scanned datasets of historical printed content represent an important resource
for linguists and others working with digital language data. Collections such
as Early English Books Online (EEBO) or Eighteenth-Century Collections
Online (ECCO), for example, comprise hundreds of millions of words. ECCO is
claimed to include ‘every significant English-language and foreign-language title
printed in Britain, Ireland, overseas territories under British colonial rule, and
the United States between the years 1701 and 1800’.
1
Similar to Web data, these
historical collections are not only large and varied, but also noisy and sometimes
lacking quantifiable metadata. Instead of boilerplate and other kinds of noise
originating from the crawling process, these datasets include OCR errors that
can create problems for computational analyses (Rastas et al. 2022). Like with
Web data, the genre or register categories of digitized historical documents may
not be known, an issue that may pose challenges for certain types of linguistic
analysis. These challenges are explored in the chapter by Repo et al.
Social media, discourse and meanings
Early research into the linguistic properties of computer-mediated
communication (CMC) considered modalities such as email, online chats or
message boards, proposing they would be ‘conceptually oral’ in terms of their
relative use of grammatical features (see, e.g., Herring 1996; Baron 1998; Crystal
2006); these online types of writing are also likely to exhibit features such as non-
standard abbreviations and orthography or use of emoticons (and later emoji).
With the rise of the ‘Web 2.0’, loosely defined as the shift in the early 2000s
towards more interactive kinds of web content, an increasing proportion of
online communication began to take place on commercial social media platforms
such as Facebook, Twitter, YouTube or Reddit. These sites, although they tend to
utilize the same underlying technical protocols for data transmission, can vary
greatly in terms of characteristic contents and user communities, and even on a
single platform, a wide range of topical concerns can be addressed by users who
may have different social and demographic characteristics and exhibit different
writing styles.
Twitter/??????, especially, has been a popular source of data for analyses of language,
including studies of language diversity and multilingualism (e.g. Mocanu et al.
2013; Coats 2019a, 2019b), the use of dialects (e.g. Grieve et al. 2019; Purschke

8 Linguistics Across Disciplinary Borders
and Hovy 2019) or the pragmatics of language use on the platform itself (see,
e.g., Zappavigna 2012), as well as for studies dealing with politics, migration,
catastrophes and a wide range of other topics (see, e.g., Tumasjan et al. 2010;
Hübl et al. 2017; Murzintcev and Cheng 2017). Twitter’s popularity was in
part due to the accessibility of its API (Application Programming Interface),
which allowed researchers to collect relatively large amounts of data with simple
scripts. Reddit has grown to become one of the world’s most popular platforms
for discussion and content sharing since its establishment in 2005. The site’s
millions of users have made billions of posts, many of which have been made
accessible in corpus form (Baumgartner et al. 2020), and a wide variety of topics
have been investigated using data from Reddit (for an overview see Proferes
et al. 2021).
Most studies of social media content have been based on textual data – the
analysis of multimedia social media content, containing images and video,
typically requires larger storage capacities, more memory and more processing
power, in addition to a command of data-analytical approaches that may not
be second nature for a linguist. In part due to the newness of multimedia
social media content, at least relative to textual content, many studies of social
media video and streaming content have therefore focused on issues such as
the description and categorization of interactive possibilities, which for a
platform like YouTube can be complex (Dynel 2014), or issues of identity raised
in particular content (e.g. Androutsopoulos 2013). Use of automatic speech
recognition (ASR) transcripts and corpora created from them may open new
possibilities and serve as a starting point for acoustic and multimodal analysis
(see Coats 2022, Chapter 2). In the third part of this book, data from Reddit and
Twitter are utilized to shed light on social and cultural issues; the analysis of
Twitter data also considers multimodal content.
Presentation of the chapters
The chapters in this book are organized into three sections corresponding
to these three aspects of the march of data. In the first section, focusing on
methods for data collection, analysis and visualization, Steven Coats explores
the new possibilities opened up by advances in ASR technology. Comparing
ASR and manual transcripts from YouTube videos, he shows how ASR, despite
noisiness in the form of transcript errors, can be useful for linguistic analysis.
In the second chapter, Jukka Tyrkkö and Daniel Ihrmark discuss low-code

9Introduction
programming environments as an accessible gateway into data science for
researchers with limited or no background in programming. They introduce
an open-source modular toolkit, KNIME, critically discuss its usefulness,
benefits and shortcomings with two example use cases, and reflect on the risks
and possibilities offered by the tool to researchers. In the third chapter in this
section, Gerold Schneider utilizes data from a historical corpus of English to
compare several different data analysis approaches popular in corpus linguistics
and digital humanities, showing how they can shed light on different aspects of
concepts as they develop over time.
Liina Repo et al. open the second section, ‘Corpus construction, registers,
and genres’, with a study on modelling registers in two large historical corpora
from the eighteenth century: the Corpus of Founding Era American English
and the Goldsmiths’-Kress Library of economic literature. With the aim of
understanding historical register variation and working towards automatic
register identification in noisy historical data, they compare several BERT-
based models to tackle the OCR noise and analyze their outputs across the
two corpora. Valtteri Skantsi et al. continue the register analysis of noisy and
very large corpora by focusing on the Finnish Internet Parsebank, a Web-
crawled corpus of Finnish containing billions of words. Using automatic
register identification and topic modelling, they study the interplay of registers
and topics and how these two methods can provide more information about
the documents in big language datasets. Similarly, Axel Bohmann considers
how document metadata can be automatically annotated in large and noisy
corpora using statistical methods. He uses multidimensional analysis to
examine the large and less carefully documented Corpus of Global Web-
based English (GloWbE; Davies and Fuchs 2015), then analyses the resulting
dimensions of linguistic variation using the International Corpus of English
(ICE; Greenbaum and Nelson 1996), finally associating the dimensions
with GloWbE documents in order to gain insight into their contents and
characteristics.
The third section, social media, discourse, and meanings, starts with the
chapter by Christopher Fitzgerald et al., which examines a corpus of tweets
about working from home during the Covid-19 pandemic. They argue that big
data approaches are not always appropriate: a small and specialized corpus can
be more suitable for holistic analysis of multimodal content from social media
platforms. Finally, Laura Hekanaho et al. explore the expression of sexual and
gender identity in online discourse in a 44-million-word corpus of posts from
Reddit. Studying in what ways and to what extent linguistic constructions

10 Linguistics Across Disciplinary Borders
of self-identification, namely ‘identify as X’, ‘be X’ and ‘as a X’, are employed
on the r/lgbt and r/nonbinary Subreddits, they shed light on the discursive
practices of self-identification for a gender/sexual minority group.
Note
1 https://www .gale .com /intl /primary -sources /eighteenth -century -collecfotions -online
References
Androutsopoulos, J. (2013), ‘Participatory Culture and Metalinguistic Discourse:
Performing and Negotiating German Dialects on YouTube’, in D. Tannen and A.
M. Trester (eds), Discourse 2.0: Language and New Media, 47–72, Washington, DC:
Georgetown University Press.
Barbaresi, A. (2021), ‘Trafilatura: A Web Scraping Library and Command-Line Tool
for Text Discovery and Extraction’, in Proceedings of the 59th Annual Meeting
of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing: System Demonstrations, 122–31, Online:
Association for Computational Linguistics.
Baron, N. S. (1998), ‘Letters by Phone or Speech by Other Means: The Linguistics of
Email’, Language & Communication, 18 (2): 133–70.
Baumgartner, J., S. Zannettou, B. Keegan, M. Squire, and J. Blackburn (2020), ‘The
Pushshift Reddit Dataset’, in Proceedings of the International AAAI Conference on
Web and Social Media, 830–9. https://doi .org /10 .1609 /icwsm .v14i1 .7347.
Berber Sardinha, T., and M. Veirano Pinto, eds. (2019), Multi-dimensional Analysis:
Research Methods and Current Issues, New York: Bloomsbury Publishing.
Biber, D. (1988), Variation Across Speech and Writing, Cambridge: Cambridge
University Press.
Biber, D. (2012), ‘Register as a Predictor of Linguistic Variation’, Corpus Linguistics and
Linguistic Theory, 8 (1): 9–37.
Biber, D., and S. Conrad (2009), Register, Genre, and Style, Cambridge: Cambridge
University Press.
Biber, D., J. Egbert, and D. Keller (2020), ‘Reconceptualizing Register in a Continuous
Situational Space’, Corpus Linguistics and Linguistic Theory, 16 (3): 581–616.
Blei, D. M., A. Y. Ng, and M. I. Jordan (2003), ‘Latent Dirichlet Allocation’, Journal of
Machine Learning Research, 3: 993–1022.
Boumans, J. W., and D. Trilling (2016), ‘Taking Stock of the Toolkit’, Digital Journalism,
4 (1): 8–23.

11Introduction
Boyd, R. L., and H. A. Schwartz (2021), ‘Natural Language Analysis and the
Psychology of Verbal Behavior: The Past, Present, and Future States of the Field’,
Journal of Language and Social Psychology, 40 (1): 21–41. https://doi .org /10 .1177
/0261927X20967028.
Brookes, G., and T. McEnery (2019), ‘The Utility of Topic Modelling for Discourse
Studies: A Critical Evaluation’, Discourse Studies, 21 (1): 3–21. https://doi .org /10
.1177 /1461445618814032.
Coats, S. (2019a), ‘Language Choice and Gender in a Nordic Social Media Corpus’,
Nordic Journal of Linguistics, 42 (1): 31–55.
Coats, S. (2019b), ‘Online Language Ecology: Twitter in Europe’, in E. Stemle and
C. Wigham (eds), Building Computer-mediated Communication Corpora for
Sociolinguistic Analysis, 73–96, Clermont-Ferrand: Presses universitaires Blaise Pascal.
Coats, S. (2022), ‘Naturalistic Double Modals in North America’, American Speech.
https://doi .org /10 .1215 /00031283 -9766889.
Conneau, A., K. Kartikay, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
M. Ott, L. Zettlemoyer, and V. Stoyanov (2020), ‘Unsupervised Cross-lingual
Representation Learning at Scale’, in Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, 8440–51, Online: Association for
Computational Linguistics.
Crystal, D. (2006), Language and the Internet, 2nd edn, Cambridge: Cambridge
University Press.
Davies, M., and R. Fuchs (2015), ‘Expanding Horizons in the Study of World Englishes
with the 1.9 Billion Word Global Web-based English Corpus (GloWbE)’, English
World-Wide, 36 (1): 1–28. https://doi .org /10 .1075 /eww .36 .1 .01dav.
de Marneffe, M.-C., C. D. Manning, J. Nivre, and D. Zeman (2021), ‘Universal
Dependencies’, Computational Linguistics, 47 (2): 255–308. https://doi .org /10 .1162 /
coli _a _00402.
Devlin, J., M. W. Chang, K. Lee, and K. Toutanova (2018), ‘BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding’, in Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, 1 (Long and Short Papers): 4171–86.
Association for Computational Linguistics.
Dunning, T. (1993), ‘Accurate Methods for the Statistics of Surprise and Coincidence’,
Computational Linguistics, 19: 61–74.
Dynel, M. (2014), ‘Participation Framework Underlying YouTube Interaction’, Journal of
Pragmatics, 73: 37–52. https://doi .org /10 .1016 /j .pragma .2014 .04 .001.
Egbert, J., and D. Biber (2019), ‘Incorporating Text Dispersion into Keyword Analyses’,
Corpora, 14: 77–104. https://doi .org /10 .3366 /cor .2019 .0162.
Firth, J. R. ([1957] 1958), ‘A Synopsis of Linguistic Theory 1930–1955’, in F. Palmer (ed),
Selected Papers of J. R. Firth, 168–205, Harlow: Longman.
Francis, W. N., and H. Kucera (1967), Computational Analysis of Present-day American
English, Providence: Brown University Press.

12 Linguistics Across Disciplinary Borders
Greenbaum, S., and G. Nelson (1996), ‘The International Corpus of English (ICE)
Project’, World Englishes, 15 (1): 3–15.
Gries, S. T. (2005), ‘Null-hypothesis Significance Testing of Word Frequencies: A
Follow-up on Kilgarriff ’, Corpus Linguistics and Linguistic Theory, 12: 277–94.
Grieve, J., C. Montgomery, A. Nini, A. Murakami, and D. Guo (2019), ‘Mapping Lexical
Dialect Variation in British English using Twitter’, Frontiers in Artificial Intelligence,
Section Language and Computation. https://doi .org /10 .3389 /frai .2019 .00011.
Harris, Z. S. (1954), ‘Distributional Structure’, Word, 10: 146–62.
Herring, S. C., ed. (1996), Computer-mediated Communication: Linguistic, Social and
Cross-Cultural Perspectives, Amsterdam: John Benjamins.
Hübl, F., S. Cvetojevic, H. Hochmair, and G. Paulus (2017), ‘Analyzing Refugee
Migration Patterns Using Geo-tagged Tweets’, International Journal of Geo-
Information, 6 (10). https://doi .org /10 .3390 /ijgi6100302.
Jurafsky, D., and J. H. Martin (2021), Speech and Language Processing, 3rd edn, draft.
Available online: https://web .stanford .edu/ ~jurafsky /slp3/.
Kilgariff, A. (2007), ‘Last Words: Googleology Is Bad Science’, Computational
Linguistics, 33 (1): 147–51.
Kilgariff, A., and G. Grefenstette (2003), ‘Introduction to the Special Issue on the Web
as Corpus’, Computational Linguistics, 29 (3): 333–47.
Laippala, V., S. Rönnqvist, M. Oinonen, A.-J. Kyröläinen, A. Salmela, D. Biber, J. Egbert,
and S.Pyysalo (2022), ‘Register Identification from the Unrestricted Open Web
Using the Corpus of Online Registers of English’, Language Resources and Evaluation.
https://doi .org /10 .1007 /s10579 -022 -09624 -1.
Laippala, V., A. Salmela, S. Rönnqvist, A. Fikri Aji, L. Chang, A. Dhifallah, L. Goulart,
H. Kortelainen, M. Pàmies, D. Prina Dutra, V. Skantsi, L. Sutawika, and S. Pyysalo
(2022), ‘Towards Better Structured and Less Noisy Web Data: Oscar with Register
Annotations’, in Proceedings of the Eighth Workshop on Noisy User-generated Text
(W-NUT 2022), 215–21, Gyeongju: Association for Computational Linguistics.
Maslow, A. (1966). The Psychology of Science: A Reconnaissance, New York: Harper &
Row.
Mikolov, T., K. Chen, G. Carrado, and J. Dean (2013), ‘Efficient Estimation of Word
Representations in Vector Space’, arXiv:1301.3781 [cs.CL]. https://doi .org /10 .48550 /
arXiv .1301 .3781.
Mocanu, D., A. Baronchelli, N. Perra, B. Gonçalves, Q. Zhang, and A. Vespignani
(2013), ‘The Twitter of Babel: Mapping World Languages Through Microblogging
Platforms’, PLoS ONE, 8 (4). https://doi .org /10 .1371 /journal .pone .0061981.
Murzintcev, N., and C. Cheng (2017), ‘Disaster Hashtags in Social Media’,
International Journal of Geo-Information, 6 (7). https://doi .org /10 .3390 /ijgi6070204.
Ortiz Suárez, P. J., B. Sagot, and L. Romary (2019), ‘Asynchronous Pipelines for
Processing Huge Corpora on Medium to Low Resource Infrastructures’, in
Proceedings of the Workshop on Challenges in the Management of Large Corpora
(CMLC7), 9–16, Mannheim: Leibniz-Institut für Deutsche Sprache.

13Introduction
Proferes, N., N. Jones, S. Gilbert, C. Fiesler, and M. Zimmer (2021), ‘Studying Reddit: A
Systematic Overview of Disciplines, Approaches, Methods, and Ethics’, Social Media
+ Society, 7 (2). https://doi .org /10 .1177 /20563051211019004.
Purschke, C., and D. Hovy (2019), ‘Lörres, Möppes, and the Swiss: (Re)Discovering
Regional Patterns in Anonymous Social Media Data’, Journal of Linguistic Geography,
7 (2): 113–34. https://doi .org /10 .1017 /jlg .2019 .10.
Rastas, I., Y. C. Ryan, I. Tiihonen, M. Qaraei, L. Repo, R. Babbar, E. Mäkelä, M.
Tolonen, and F. Ginter (2022), ‘Explainable Publication Year Prediction of
Eighteenth Century Texts with the BERT Model’, in Proceedings of the 3rd Workshop
on Computational Approaches to Historical Language Change, 68–77, Dublin:
Association for Computational Linguistics.
Repo, L., V. Skantsi, S. Rönnqvist, S. Hellström, M. Oinonen, A. Salmela, and V.
Laippala (2021), ‘Beyond the English Web: Zero-shot Cross-lingual and Lightweight
Monolingual Classification of Registers’, in Proceedings of the 16th Conference of the
European Chapter of the Association for Computational Linguistics: Student Research
Workshop, 183–91, Online: Association for Computational Linguistics.
Roberts, M. E., B. M. Stewart, and E. M. Airoldi (2016), ‘A Model of Text for
Experimentation in the Social Sciences’, Journal of the American Statistical
Association, 111 (515): 988–1003. https://doi .org /10 .1080 /01621459 .2016 .1141684.
Santini, M., A. Mehler, and S. Sharoff (2021), ‘Riding the Rough Waves of Genre on the
Web: Concepts and Research Questions’, in A. Mehler, S. Sharoff, and M. Santini
(eds), Genres on the Web: Computational Models and Empirical Studies, 3–30,
Dordrecht: Springer. https://doi .org /10 .1007 /978 -90 -481 -9178 -9 _1.
Scott, M. (1997), ‘PC Analysis of Key Words — And Key Key Words’, System, 25 (2):
233–45.
Sharoff, S., Z. Wu, and K. Markert (2010), ‘The Web Library of Babel: Evaluating Genre
Collections’, in Proceedings of the Seventh International Conference on Language
Resources and Evaluation (LREC '10), Valletta, Malta: European Language Resources
Association (ELRA).
Stubbs, M., and C. Tribble (2006), Textual Patterns: Key Words and Corpus Analysis in
Language Education, Amsterdam: John Benjamins.
Tumasjan, A., T. Sprenger, P. Sandner, and I. Welpe (2010), ‘Predicting Elections with
Twitter: What 140 Characters Reveal about Political Sentiment’, in Proceedings of
the International AAAI Conference on Web and Social Media, 178–85, Menlo Park:
Association for the Advancement of Artificial Intelligence.
Zappavigna, M. (2012), Discourse of Twitter and Social Media: How We Use Language to
Create Affiliation on the Web, London and New York: Continuum.

14

Part I
Methods for Data Collection,
Analysis and Visualization

16

1
Noisy Data
Using Automatic Speech Recognition
Transcripts for Linguistic Research
Steven Coats
Introduction
Widespread use of automatic speech recognition (ASR) algorithms has become a
standard feature of video streaming services and online conferencing platforms,
a development which not only contributes to improvements in human–machine
interaction systems but also offers new opportunities for the collection of
naturalistic spoken language data for the purposes of linguistic analysis. While
increased availability of this kind of data opens up new horizons by providing
access to massive amounts of naturalistic speech via the internet, the quality of
ASR transcripts can be low: they may contain word items that are not in the
original recording, the result of errors by the transcription algorithm, or can omit
discourse content for various reasons, such as algorithmic errors, speech overlap
or poor audio fidelity. For some corpus creation projects, ASR has been judged
to be of limited use: the creators of the BNC2014, for example, considered ASR
for the creation of transcripts from hundreds of hours of conversation recorded
on mobile telephones but found its quality to be insufficient and instead utilized
a team of workers to manually transcribe the audio data (McEnery 2018, 11).
Manual transcription of audio recordings, however, is a time-consuming and
expensive process, and few researchers have the resources necessary to manually
transcribe large amounts of audio data. In this respect, ASR offers significant
advantages in terms of cost-effectiveness and data accessibility. As developments
in neural network modelling continue to improve the quality of ASR systems, it
seems likely that some types of linguistic analysis will increasingly make use of
processing pipelines that include ASR-generated transcript data. The prospect

18 Linguistics Across Disciplinary Borders
gives rise to the question: Can ‘noisy data’ (i.e. data containing errors) such ASR
transcripts be useful for linguistic analysis, despite errors?
This chapter provides a provisional affirmative answer to this question on the
basis of a comparison of ASR transcripts with manually uploaded transcripts
for YouTube videos. To begin, word error rates (WERs) are calculated for ASR
and manually uploaded transcripts of the same videos indexed in four recent,
publicly available corpora of YouTube ASR transcripts: CoNASE, the Corpus
of North American Spoken English (Coats 2020, 2023), CoBISE, the Corpus of
British Isles Spoken English (Coats 2022b), CoANZSE, the Corpus of Australian
and New Zealand Spoken English (Coats 2022a), and CoGS, the Corpus of
German Speech. To test the suitability of ASR data for analytical purposes, ASR
and manual transcripts are used to train a support vector machine classifier
in order to determine the provenance of particular transcripts from England
or from Scotland, and the distinctive vocabulary of the comparison samples
is identified by two methods: the most informative model feature weights
associated with individual lexical items and the log-likelihood measure
(Dunning 1993). It can be shown that WERs are relatively low, and that the
classification algorithm generates largely the same results whether the models
have been trained using ASR or manually transcribed data. A closer look at the
distinctive vocabulary of the test data shows that for the most part, the word
types identified on the basis of classifier feature weights and those with the
highest log-likelihood scores are the same. These results suggest that for some
kinds of tasks, ASR data can be a suitable proxy for manually transcribed data
for linguistic analysis purposes.
The chapter is organized as follows: In the section ‘Background: ASR’,
some background on ASR technology and the use of ASR transcripts in
linguistic research is provided, and the section ‘Comparison of ASR and
manual transcripts’ describes the procedure used to retrieve manual and ASR
transcripts from YouTube in order to calculate the WER of the ASR transcripts,
then presents the WER results. The section ‘Classification model using ASR and
manual transcripts’ compares the ASR and the manually uploaded transcripts
as training data for determining whether a given transcript is from England or
from Scotland and further discusses the word types that are identified as most
distinctive using two methods: model weights in the classifier and the log-
likelihood score. The ensuing discussion section contextualizes these results,
especially in light of several important methodological and analytical caveats
pertaining to the nature of the corpora, and the summary section presents a
brief outlook for future work along these and similar lines.

19Noisy Data
Background: ASR
ASR systems have been in use for more than a half century, but only relatively
recently have they begun to achieve high accuracy levels on naturalistic audio
data, a development due in part to the use of increasingly sophisticated deep
neural network architectures and very large language models (Wang et al. 2019).
This section briefly notes some recent advances in ASR technology, identifies
some issues which can affect the accuracy of ASR algorithms, and discusses
linguistics research endeavours that have made use of ASR transcripts. YouTube’s
Application Programming Interfaces (APIs) and methods for accessing them
are discussed, as well as the reuse of data that may be copyrighted in linguistic
corpora for research purposes.
ASR accuracy
ASR is an important topic in computer science, and significant advances have
been made in recent years by utilizing neural network architectures (Liao et al.
2013; Sainath et al. 2015) and transformer models, with some systems reporting
accuracy comparable to that of human transcribers in word error rate (i.e. the
proportion of words incorrectly transcribed in a given audio file) (Amodei
et al. 2016; Xiong et al. 2017). WERs of 5–6 per cent, comparable to those of
human transcribers, have been reported for ASR algorithms for tasks such as
the transcription of utterances for voice search services (Chiu et al. 2018), and
recent transformer-based approaches have reported WERs as low as 1.8 per cent
for spoken texts recorded by a single speaker under laboratory conditions, even
when using limited amounts of labelled training data (Baevski et al. 2020).
Naturalistic conversation, however, is more difficult to automatically
transcribe, with automatic transcription systems exhibiting WERs in the range
of 20–50 per cent (Kim et al. 2019; Koenecke et al. 2020). Higher WERs for
naturalistic conversation can result from low recording quality as well as speech
signal properties related to individual variation such as speaker fluency, use
of non-standard words, accent, speech rate, pitch and other prosodic features
(Aksënova et al. 2021).
ASR accuracy for regional varieties of English can be lower if systems have
not been trained with appropriate data or use language models that do not
correspond to the type of speech being transcribed. Higher error rates have
been reported for ASR transcripts of Scottish English/Scots or Indian English,
compared to Southern United Kingdom or American English, possibly because

20 Linguistics Across Disciplinary Borders
the systems were trained primarily with data from Southern United Kingdom
and standard American speakers (Tatman 2017; Markl et al. 2021; Meyer et al.
2020).
There have been relatively few attempts to use ASR transcripts for corpus
building specifically for the purposes of linguistic analysis. Scherrer, Glaser
and Samardžić (2019) described creating a corpus of Swiss German from
approximately 48 hours of audio and manual transcriptions, and Nigmatulina
et al. (2020) presented the results of an ASR system designed to automatically
transcribe the same data. They reported modest accuracy levels, with F1
scores between 0.3 and 0.5, but noted that model accuracy may be affected by
the inherent inconsistency of the underlying manual transcriptions: for Swiss
German, there is not a universally accepted transcription norm.
Coto-Solano et al. (2021) compared the ASR transcription systems
DeepSpeech (Hannun et al. 2014) and CMU Sphinx (Lamere et al. 2003) for
a task in which a script pipeline was used to extract target words from audio
files, and then to generate values for vowel formants of North American English.
Although the study was not per se focused on transcript accuracy – the pipeline
was used to automatically identify word items from which targeted vowels could
be extracted using forced alignment algorithms – it found that the DeepSpeech
system generated a vowel space that was more similar to a gold standard based
on manual transcripts, compared to the CMU Sphinx system. The authors
pointed out that ‘errors in the transcription may not affect the overall goal of
producing vowel formants that are generally representative of a speaker’s dialect
features’ (Coto-Solana et al. 2021, 3).
As of 2023, ASR systems utilizing state-of-the-art self-supervised
transformer models with millions or billions of parameters can achieve low
WERs for multilingual transcription tasks even with limited amounts of
training data (Babu et al. 2021), and similar approaches may be effective for
automatic transcription of accented and dialectal speech (Shi et al. 2021). The
trend towards ever-improving ASR quality will undoubtedly make this kind
of data more useful for linguistic analysis, but, as Coto-Solana et al. point out,
even data containing errors can provide an analytical basis for valid linguistic
inferences. Data processing and analysis techniques have been developed to deal
with ‘noisiness’, including erroneous transcriptions or missing textual content.
For machine-learning classification tasks, for example, artificially introducing
errors into 30 per cent of the word tokens in a text will not significantly affect
classifier accuracy (Agarwal et al. 2007), because given enough input data, the
preponderance of accurate tokens for a given word type will outweigh the effects

21Noisy Data
of the errors. Considerations along these lines suggest that ASR transcripts can
be useful for corpus-based linguistic analysis.
YouTube ASR data
ASR transcripts were first introduced for some videos on YouTube beginning
in 2009 (Google 2009). Through their APIs, YouTube provides ASR transcripts
(known as ‘captions’) in six formats, as of early 2023: TTML (Timed Text Markup
Language, a W3C standard that is basically a variant of XML
1
), the custom formats
SRV1, SRV2 and SRV3 (which are derived from TTML but differ in terms of tag
and attribute structure and inclusion of timing information), WebVTT (Web
Video Text Tracks, a newer W3C standard
2
) and JSON3, based on the JSON data
serialization format. YouTube ASR transcripts are not ‘diarized’, that is, there is
no use of labels or other textual indicators to indicate a change in speaker.
ASR transcripts, as well as comments, likes, user information, and other
metadata associated with videos and users on the platform can be accessed via
YouTube’s official API. This API, however, utilizes a quota system in which users
are provided daily access to a limited amount of data. In practice, the official API
at default access level is unsuitable for large-scale data collection.
3
In addition to
the official API, YouTube maintains an undocumented API (sometimes referred
to as the ‘innertube’) which is used to provide access via the YouTube website
to videos, captions, comments and other data. When a user accesses a YouTube
video, a cryptographic key is automatically generated for the requesting IP
address; video streaming content, transcripts and other data are then delivered
via temporary URLs to the IP address. This feature of YouTube’s functionality
makes it possible to collect video and transcript (and other) data using scripts
and open-source software. A number of open-source scripting libraries have
been developed to access YouTube content via the ‘innertube’. The data for this
study was collected using the popular open-source YouTube-DL library as well
as YT-dlp, a fork of YouTube-DL with some additional features.
4
Regarding the legal framework for the utilization of data collected from
YouTube or other commercial platforms for the creation of research resources
such as corpora, legal doctrines vary according to jurisdiction and usage policies
of online platforms continue to evolve. A number of jurisdictions, including
the United States, Canada, the United Kingdom, Australia, New Zealand
and Germany, permit the reuse of copyrighted material for non-commercial
educational and research purposes (Fiil-Flynn et al. 2022). In the United
States, the home jurisdiction of YouTube, ‘Fair Use’ provisions of copyright law

22 Linguistics Across Disciplinary Borders
generally permit data to be harvested and reused for purposes such as corpus
creation. Factors which govern the determination if a particular case is Fair
Use in US law include ‘the purpose and character of the use, including whether
such use is of a commercial nature or is for nonprofit educational purposes, the
nature of the copyrighted work, the amount and substantiality of the portion
used in relation to the copyrighted work as a whole; and the effect of the use
upon the potential market for or value of the copyrighted work’.
5
Corpora
such as the ones used in this study, which were created for nonprofit purposes
from data uploaded by government institutions, and which in total comprise
an infinitesimal proportion of the total amount of data available on YouTube,
have no effect on the market for YouTube content, and thus qualify as fair
use according to these factors. European and German law suggest that data
of the sort collected in these corpora can be shared for research purposes as
corpora. Directive 2003/98/EC of the European Parliament establishes a legal
basis for reuse of public sector information: ‘re-use means the use by persons
or legal entities of documents held by public sector bodies, for commercial
or non-commercial purposes other than the initial purpose within the public
task for which the documents were produced’; a document is defined as ‘any
content whatever its medium (written on paper or stored in electronic form or
as a sound, visual or audiovisual recording)’, or ‘any part of such content’.
6
For
Germany, § 60d of the Gesetz zur Angleichung des Urheberrechts an die aktuellen
Erfordernisse der Wissensgesellschaft (‘law on the adjustment of copyright to
the needs of knowledge-based society’) permits data collection of copyrighted
material for non-commercial research purposes such as the creation of corpora
(UrhWissG 2017).
As large data sets become more important as inputs in all domains of human
society, changes in the legal framework for data collection will likely develop
in the direction of providing more access to data, rather than less. Publicly
available resources created from web scraping have received institutional or
governmental grants.
7
Personally identifiable information of the kind subject
to specific legal protections under the General Data Protection Regulation
(GDPR) of the EU are generally not recorded in contents uploaded by local
government websites, and individual transcript files have no personal or
demographic metadata pertaining to speakers. These considerations suggest
that careful reuse of ASR transcript data from YouTube for linguistic or
scientific research is a legitimate use case, and that similar techniques can be
applied to create resources targeting specific language phenomena or discourse
contents.

23Noisy Data
Comparison of ASR and Manual Transcripts
This section describes the procedures used to evaluate ASR accuracy for four
different sets of YouTube transcripts by comparing the ASR transcripts with
manually uploaded transcripts for the same videos. The comparison is based on
transcripts indexed in four recent, publicly available corpora of YouTube ASR
transcripts from North America, the British Isles, Australia and New Zealand
and Germany (Coats 2019, 2022a, 2022b, 2023, in review). These corpora have
been designed along the same lines and have been created using similar methods:
they contain ASR transcripts of videos from channels of local governments or
councils, representing a diverse range of video types, speakers, and interaction
and recording contexts. Many of the transcripts in the four corpora (with the
exception of CoGS) are from meetings of various types.
Method
For each of the four corpora, the video IDs of all of the ASR transcripts in the
corpus were used as inputs in a script that checked, for each video, if both ASR
and manual captions were available; if so, both were downloaded in the same
data format.
8
Relatively few of the videos whose ASR transcripts comprise the
four corpora used in the study also had manual captions (Table 1.1).
Australian and New Zealand videos were the most likely to have manually
captions, followed by British Isles and German videos. North American videos
were least likely to have manual transcripts, with 1.3 per cent of the video IDs
recorded in CoNASE having a manual transcript file available, a fact that may
reflect cost and convenience factors.
Inspection of the manual transcripts revealed that some of the downloaded files
were empty, with no textual content (e.g. approximately 11 per cent of those from
CoNASE), presumably due to human error.
9
In addition, some manual transcripts
were translations from English to various other languages, presumably uploaded
by councils and local governments to make the content of English-language
Table 1.1 Number of ASR and Manual Transcripts by Corpus
ASR Transcripts Manual Transcripts Percentage with both
CoNASE 301,846 3,982 1.3
CoBISE 38,680 2,686 6.9
CoANZSE 56,815 4,694 8.3
CoGS 39,495 2,408 6.1

24 Linguistics Across Disciplinary Borders
videos accessible to non-English-speaking viewers.
10
These were removed from
consideration by applying SpaCy’s automatic language detection module to the
transcripts and removing pairs containing a non-English file.
In order to calculate WER values, it was necessary to convert the raw transcript
data for the ASR and manual transcripts, retrieved in the TTML format, to
sequences of words representing speech, comparable in terms of XML structure,
punctuation, capitalization, line spacing, diarization and representation of
numerical values. While the TTML files of YouTube ASR transcripts have a
standardized and predictable format, the manual transcript files can exhibit
variation for these features. Depending on the practices of the manual transcriber
(or the ASR system whose output has been uploaded as a manual transcript file;
see below), diarization in manual transcripts may consist of personal names
followed by a colon, arrows or greater-than symbols (e.g., ‘>>’), dashes or other
marks. Manual transcripts can have inconsistent line-break characters, unescaped
HTML entities and/or encoding errors, and non-standard punctuation. In
addition, both manual and ASR transcripts can contain non-speech textual
content, for example, to indicate music, applause or laughter. Inconsistency in the
representation of numerical values can also contribute to higher WERs: numbers,
components of street addresses, and times can be represented in transcripts as
words, numerals or a combination thereof (e.g. 3 or three, 5
th
or fifth, and 12:30
or twelve thirty). Figure 1.1 shows two (invented) transcript excerpts that could
be considered equivalent in terms of underlying speech content, but whose
transcripts, without processing, would result in a high WER:
To take into account these factors, a script was devised to normalize the
ASR and manual transcripts to be as consistent as possible in terms of how the
underlying speech signal is represented. The script comprised the following
steps to generate comparable word sequences:
Figure 1.1 Theoretically equivalent transcripts with high WER.

25Noisy Data
1. Ensure file encoding is UTF-8; convert if necessary
2. Normalize HTML entities (e.g. convert & to &)
3. Retrieve text from all HTML paragraph (<p>) elements
4. Convert all numerals to word forms
11
5. If the string contains one or two words followed by a colon (possible
diarization-speaker indication), remove these and the colon
6. Convert to lowercase
7. Remove all text within square brackets (in ASR transcripts and in some
manual transcripts used to indicate non-speech content such as music)
8. Remove punctuation
9. Remove any remaining non-word characters
10. Reduce all sequences of space characters to a single space
11. Filter out any videos for which the automatically detected ASR or manual
transcript language is not English
12. Calculate WER
Some manual transcripts did not include the complete speech content of a video,
but only a portion of it.
12
For these videos, the ASR transcripts were typically
much longer than the corresponding manual transcripts, resulting in WER
values much greater than 1. These were filtered by removing all transcript pairs
with a WER value above 0.9.
13
In some cases, ASR system errors by YouTube or by third-party services may result
in erroneous transcripts. For example, the manual transcript for the video https://
www .youtube .com /watch ?v =UYccMJqwBTY, from Hendersonville, Tennessee,
contains the single word d?d?[ód?d??d?d?qd?Ñ. Although it is impossible to know
exactly how this transcript was generated, it may be the case that a third-party ASR
algorithm misidentified an excerpt from the audio signal as a language normally
written with a non-Latin script; the UTF-8 bytes used to render the fragmentary and
short erroneous ASR transcript may have then been converted into Windows-1252 or
another encoding scheme, resulting in a meaningless sequences of error characters.
Videos with erroneous manual transcripts like this were automatically removed by
filtering out those with WER values greater than 0.9, as described earlier.
This basic text processing procedure, although not perfect, was mostly able
to create comparable transcripts containing text representations of speech in the
corresponding videos.
Word error rate, a standard for measuring the accuracy of ASR systems, was
used to compare the cleaned ASR and manual transcripts. WER is calculated
according to the formula

26 Linguistics Across Disciplinary Borders
where S represents the number of substitutions, D deletions and I insertions
necessary to transform the text whose accuracy is to be tested (the hypothesis
text, here the ASR transcript) to a ground-truth text (here the manually uploaded
transcript); N is the number of words in the ground-truth text. WER can range
from 0 (texts are identical) to 1 (texts have zero overlap), assuming the two input
texts have the same number of word tokens. If the hypothesis text is longer than
the ground-truth text, WER values can be greater than 1. As described earlier,
in the test data, some ASR files were significantly longer than the corresponding
manually uploaded files for the same videos, mostly because the manually
uploaded files only contained a partial representation of the speech in the
corresponding videos. These transcript pairs were filtered after calculation of
WER, which was implemented using the jiwer library in Python (Vaessen 2021).
WER values
Mean and median WER rates after the normalization and filtering of the input
files are shown in Table 1.2; the distributions of WER values in Figure 1.2.
As can be seen in Figure 1.2, the distributions of WER values are quite
similar for the test samples from North America, the British Isles, Australia,
New Zealand and Germany. To some extent, the similarity in values reflects the
fact that the underlying ASR-manual transcript pairs have been subject to the
same processing and filtering steps, which were designed to remove transcripts
in which the underlying audio signal did not match with the transcript content.
The similarity in WER mean and median values for the four sampled corpora as
well as the shapes of the distributions also function as a kind of ‘sanity check’: If
the WER values or distributions were drastically different for different corpora, it
might indicate there has been some error in processing the data. The underlying
ASR system components are the same (although the acoustic and language
Table 1.2 Mean and Median WERs by Corpus
Mean Median
CoNASE 0.15 0.12
CoBISE 0.16 0.12
CoANZSE 0.14 0.11
CoGS 0.17 0.13
WER
SDI
N
=
++

27Noisy Data
models may differ for the different languages and varieties), so the fact that the
values and distributions are largely similar for four large samples of naturalistic
data is something to be expected.
Classification model using ASR and manual transcripts
In order to validate the hypothesis that ASR transcripts can be useful in linguistic
analysis for analytical and interpretative purposes, despite transcript errors,
simple machine-learning models were created using Scikit-learn (Pedregosa et
al. 2011). The task of the models was to correctly assign texts from CoBISE to
England or Scotland, based on lexical features.
To set up the models, individual transcripts were converted to term frequency-
inverse document frequency (tf-idf) matrices (Manning et al. 2008),
14
then
trained using a linear support vector machine (SVM; Joachims 1998) with 80
per cent of the English and Scottish transcripts (1,399 English and 399 Scottish
transcripts), using parameters optimized with the GridSearchCV method in
scikit. Then, country labels for transcripts were predicted by the models for the
remaining 20 per cent of the data (341 English and 159 Scottish transcripts).
The first model was trained with manual transcripts, and the second model with
Figure 1.2 Distribution of WER values for videos from four corpora.

28 Linguistics Across Disciplinary Borders
ASR transcripts. Model accuracy is summarized in Table 1.3 and the confusion
matrices for the models shown in Figure 1.3.
As can be seen, the results are nearly identical when using manual transcripts
and using ASR transcripts: for purposes of a simple predictive binary
classification, ASR transcripts seem to be a good proxy for manual transcripts.
Additional insight into the differences between the two models can be
gained by examining the feature weights: for these models, the lexical items
that contribute the most to the categorization of a given transcript as English or
Scottish. In Figure 1.4, the most important features are shown for the manual
transcript model (above) and the ASR transcript model (below).
The features with positive weights (black bars) contribute to a classification
of a given text as Scottish, while features with negative weights (gray bars) tend
to result in a classification of a text as English. As is expected, toponyms and
toponym-derived adjectives such as Scotland, Aberdeenshire, Scottish, Edinburgh,
Northumberland, Kinross, Glasgow, Perth, Cornwall, Ayrshire, Gateshead,
Aberdeen and Grampian are among the most important distinguishing features
in the models. In the manual transcript model, wee, a Scottish word that may not
be accurately rendered in ASR transcripts, is important, as is the initialism SEPA
(the Scottish Environmental Protection Agency), an organization whose channel
Table 1.3 Summary of Classification Results
Model 1 (manual) Model 2 (ASR) Ground-truth
labels (Test
Data)Precision Recall F1 Precision Recall F1
England 0.91 0.99 0.95 0.91 0.99 0.95 341
Scotland 0.98 0.80 0.88 0.98 0.78 0.87 159
Accuracy 0.93 0.93
Figure 1.3 Confusion matrices for classifiers trained with manual transcripts (left)
and ASR transcripts (right).

29Noisy Data
Figure 1.4 Most important features in the models trained with manual transcripts
(above) and ASR transcripts (below).

30 Linguistics Across Disciplinary Borders
is the source of some of the videos in the dataset. There are 147 occurrences of
wee in the manual transcripts used for the classification task, compared to 57
in the ASR transcripts, and 78 occurrences of SEPA in the manual transcripts,
compared to 14 in the ASR transcripts. Other words with strong feature weights
reflect the channel composition of the English and Scottish CoBISE subcorpora:
painting, paintings, panel and picture reflect the inclusion of transcripts from the
Scottish National Museum in the dataset.
While predictive classification is a common task in natural language processing,
traditional corpus-linguistic analysis more often uses feature frequencies to
characterize language varieties in terms of social-demographic, regional or
situational variables. The log-likelihood measure, or G score, widely used in corpus
linguistics for the comparison of feature frequencies (Dunning 1993; Rayson and
Garside 2000), quantifies difference in relative frequency between two corpora
for a given feature such as a word type or grammatical construction. Examining
the items with the highest G scores, and comparing them with the items with the
highest model weights in the SVM classifier, can help to demonstrate similarities
and differences between the predictive model-based approach and more
traditional corpus analysis techniques (cf. Laippala et al. 2021).
Figure 1.5 shows the twenty-five types with the highest G scores for use in
English and Scottish transcripts for the manual transcripts (above) and the ASR
transcripts (below).
As can be seen, most of the types with the highest G scores are the same for
the manual and the ASR transcripts: at least for lexical types, ASR transcript
frequencies, despite errors, can be considered a good proxy for manual transcript
frequencies using traditional corpus-linguistic quantitative measures such as the
G score. Some of the types with the highest G scores are also among the features
with the greatest weights in the predictive models, mostly toponyms. However,
the other items among the types with high G score values are not attested among
the top features in the predictive models, likely due to the conversion from
frequencies to tf-idf values and the different mathematical operations underlying
the calculation of feature weights for support vector machines, compared to the
calculation of G scores.
Discussion and caveats
The analysis demonstrates that ASR transcripts are comparable to manual
transcripts, both in terms of their performance on a classification task and

31Noisy Data
Figure 1.5 Word types with highest G scores for manual transcripts (above) and ASR
transcripts (below).

32 Linguistics Across Disciplinary Borders
the individual features (word types) recognized as important, whether using a
support vector machine algorithm or a traditional log-likelihood measure.
The WER values and classifier results suggest that ASR transcripts may be
useful for certain types of linguistic analysis and prediction, despite the presence
of errors. However, several caveats pertaining to the nature of the manual
transcript data analyzed in this study must be noted.
The first important consideration is the fact that many of the files uploaded
as manual transcripts to YouTube have possibly not been generated by a human
transcriber, but rather by a non-YouTube proprietary ASR algorithm. Some
municipalities, particularly in the United States and Canada, use the services
of companies to manage the organization, file upload and online accessibility
of records of council meetings, including ASR transcription of videos.
15
These
municipalities, mostly larger cities, may host video and caption content on
their own dedicated websites, as well as uploading videos and transcripts to
their YouTube channels. For municipalities using these services, captions files
manually uploaded to YouTube may be ASR captions files generated by the
proprietary or licensed algorithms of the commercial content management
service that has been contracted by the locality. As is the case with YouTube,
the language models and parameter settings used for ASR captioning by these
commercial services are ‘black boxes’, and not publicly available. In addition, and
unlike YouTube ASR transcripts, it can be difficult to determine if a particular
manual captions file for a YouTube video has been created by human or generated
by an ASR algorithm.
16
This means that the WER values calculated for this study cannot be said to
represent human transcription versus YouTube’s ASR algorithm. Rather, the
values may correspond to a comparison of an aggregate of human transcribers
and non-YouTube third-party ASR algorithms with YouTube’s own ASR
algorithm. While this is not the same as directly comparing human-created
manual transcripts with ASR transcripts, the interpretation of the WER values
is the same: lower WERs indicate higher-quality ASR transcripts. Different ASR
algorithms can generate transcripts that are not identical, just as different human
annotators can produce different transcripts. Overall and in aggregate, lower
WER values will likely correspond to more accurate transcripts.
Manually uploaded transcripts can contain errors even if the acoustic segments
have been faithfully transcribed. As noted earlier, for some manual transcripts,
the audio signal of the recording has not been completely transcribed: some
speech, typically at the beginning or the end of the audio, has been left out,
perhaps because it includes content not pertaining to one of the topics on the

33Noisy Data
council’s agenda for that day or is otherwise not considered important for the
business of the council. While the WER cutoff filtering procedure described
in the section ‘Comparison of ASR and manual transcripts’ removes many of
these videos due to the fact that they have fewer words than the corresponding
ASR transcripts, it cannot identify cases in which, for example, the manual
transcript has been cut off, but the ASR transcript does not contain the entire
speech content of the video due to poor audio quality or ASR errors. Such
transcript pairs could have comparable length in terms of number of tokens,
but would exhibit artificially high WERs, as the two transcripts do not record
the same content. It is not known if cases such as this are part of the data for this
comparison, as none were detected manually.
Other sources of error in the data in this analysis are theoretically possible.
For example, some videos may have inaccurate manually uploaded transcripts
because the human transcriber uploaded the wrong transcript file, but the
file is comparable in length to the ASR transcript. Keyboarding errors during
the preparation of manual transcripts resulting from copy-pasting or other
manipulations are also conceivable. It is impossible to estimate the extent of
these phenomena in the data, but such transcripts are likely to have been filtered
out by the procedures described earlier.
A more general consideration pertaining to the usefulness of ASR transcripts
in linguistics studies has to do with the lack of detailed metadata about
speakers. ASR transcripts do not contain any information about demographic
characteristics of speakers. YouTube ASR transcripts also lack diarization. As
such, ASR transcript files may not be suitable for some types of sociolinguistic
analysis without further manual annotation. Nonetheless, they can provide
reasonably accurate representations of the speech content of a video file.
It should also be remarked that question of whether there are linguistic
differences in the samples considered in the classification experiment described
in this chapter is not per se addressed. It is known that toponyms and other
place-specific lexical items serve to distinguish regional language varieties even
in the absence of grammatical and syntactic variation (cf. Eisenstein et al. 2014;
Dunn 2019). In the context of a linguistic analysis of lexical variation, such items
should be excluded.
Ultimately, without manually examining every manual transcript file in the
data set, it is impossible to provide a complete description of all possible sources
of error. Nevertheless, the fact that the classification experiment returned nearly
identical results for manual and ASR transcripts, and the fact that classifier
feature weights were interpretable in a straightforward manner, suggests that

34 Linguistics Across Disciplinary Borders
ASR transcripts may be a reliable proxy for manual transcripts for certain types
of analysis and/or prediction.
Summary and outlook
This chapter has explored the possibility of harnessing the vast amounts of
automatically generated transcripts from streaming platforms such as YouTube
for the purposes of linguistic research. Specifically, the chapter has focused on
the accuracy of ASR transcripts. By retrieving ASR and manual transcripts for
the same videos, WER scores could be calculated; these were found to be in the
range of 14–17 per cent for videos in which transcripts were filtered to ensure
equivalence. While these WER rates are higher than those of conversational
speech transcribed by human annotators, which have been reported to be in the
range of 4–11 per cent (Lippmann 1997; Xiong et al. 2017), the preponderance
of correctly transcribed word forms in ASR transcripts makes them useful for
aggregate analysis approaches based on large data sets. Predictive classifiers
in machine-learning models, for example, are based on aggregate feature
frequencies and are thus relatively little affected by transcript errors, as the
analysis in the section ‘Classification model using ASR and manual transcripts’
has shown. This kind of approach may help to mitigate the effects of errors in
the data, as the stronger signal of accurately transcribed lexical items in ‘noisy
data’ will result in the same lexical items being identified in both manual and
ASR transcripts as feature weights in a support vector machine model and as
keywords using a log-likelihood calculation.
The widespread availability of ASR transcript data at present and continual
refinements in ASR models point towards increased utilization in the future
of this kind of data for purposes of empirical research in linguistics, especially
in fields in which large-scale data collection would require extensive manual
resources. The utilization of this kind of ‘noisy data’, while not without challenges,
opens up the possibility of new research perspectives into contemporary
language use.
Notes
1 https://www .w3 .org /TR /ttml1/
2 https://www .w3 .org /TR /webvtt/

35Noisy Data
3 As of early 2023, YouTube’s default quota provides users with 10,000 quota units per
day, but listing the available captions files for a single video has a quota cost of 50
units, and downloading a single transcript costs 200 units.
4 https://youtube -dl .org, https://github .com /yt -dlp /yt -dlp.
5 U.S. Code Title 17, § 107.
6 https://www .enisa .europa .eu /topics /risk -management /current -risk /laws -regulation
/e -business /directive -2003 -98 -ec.
7 For example, the corpora at https://english -corpora .org, or the web corpora
accessible via the Berlin-Brandenburg Academy of Sciences at https://dwds .de.
8 A somewhat similar approach has been taken in the https://github .com /2dot71mily
/youtube _captions _corrections project, in which ASR transcripts and manually
keyboarded captions are downloaded via APIs, then the difflib library used to
highlight differences.
9 Many American municipalities utilize the services of private companies to
manage YouTube accounts and other social media presences; it is possible that the
uploading pipeline for transcripts was incorrectly configured for these videos.
10 Two examples are https://www .youtube .com /watch ?v =xDFxsxGGmsc, for which
the manual captions are a French translation of the English speech of the video, or
https://www .youtube .com /watch ?v =Ed9arIYmucI, in which the manual captions
are in Turkish.
11 Using the num2words library in Python.
12 For example, the manual transcript for a meeting of the city council of Malibu,
California (https://www .youtube .com /watch ?v =wZ4z2NhxUGQ) begins
approximately 5 minutes after the video (and the audio signal) begins. For a meeting
of Metro Nashville, Tennessee (https://www .youtube .com /watch ?v =Ed9arIYmucI), the
manual transcript begins 1 minutes 50 seconds after speech in the video commences.
13 Because the denominator in the WER formula is the length of the ‘ground truth’ text
in number of words, with a minimum value of 1, and the manual transcripts were
considered to be the ground-truth texts, WERs for videos in which ASR transcripts
are longer than WER transcripts can theoretically have a value greater than 1.
The WER cutoff value of 0.9 was determined on the basis of manual inspection of
randomly chosen videos: for those with complete or nearly complete manual and
ASR transcripts, no WER value > 0.7 was found. The cutoff value of 0.9 provides
additional leeway: it was chosen in order to retain all ASR-manual transcript pairs
that can be reasonably considered to represent the same speech content, without
including pairs which may not represent the same underlying speech content.
14 With minimum and maximum document frequency values of 0.1 < x < 0.9 and
after removal of common English stopwords.
15 In North America, Granicus is a major provider of these services. Other
municipalities, such as Dallas, Texas, for example, use the services of the
commercial company Swagit.

36 Linguistics Across Disciplinary Borders
16 Some ASR-generated manual captions are recognizable due to characteristic text
formatting features: For example, one commercial service generates transcripts in
which the individual ‘words’ are two-character sequences. The visual effect of this
transcript formatting is that the individual words ‘unroll’ on the screen when the
corresponding video is viewed (e.g. aNeZBwGyFkk). This kind of transcript, which
would have very little overlap with a transcript comprised of standard lexical items,
would typically exhibit an ASR > 0.9. In this study, such transcripts have mostly
been filtered out by the processing steps described earlier.
References
Agarwal, S., S. Godbole, D. Punjani, and S. Roy (2007), ‘How Much Noise Is Too Much:
A Study in Automatic Text Classification’, in G. Jagannathan and R. N. Wright (eds),
Proceedings of the Seventh IEEE International Conference on Data Mining (ICDM
2007), 3–12. https://doi .org /10 .1109 /ICDM .2007 .21.
Aksënova, A., D. van Esch, J. Flynn, and P. Golik (2021), ‘How Might We Create
Better Benchmarks for Speech Recognition?’, in Proceedings of the 1st Workshop
on Benchmarking: Past, Present and Future, 22–34, Stroudsburg: Association for
Computational Linguistics. https://doi .org /10 .18653 /v1 /2021 .bppf -1 .4.
Amodei, D., S. Anthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper,
B. Catanzaro, Q. Cheng, G. Chen, J. Chen, J. Chen, Z. Chen, M. Chrzanowski, A.
Coates, G. Diamos, K. Ding, N. Du, E. Elsen, J. Engel, W. Fang, L. Fan, C. Fougner,
L. Gao, C. Gong, A. Hannun, T. Han, L. Vaino Johannes, B. Jiang, C. Ju, B. Jun, P.
LeGresley, L. Lin, J. Liu, Y. Liu, W. Li, X. Li, D. Ma, S. Narang, A. Ng, S. Ozair, Y.
Peng, R. Prenger, S. Qian, Z. Quan, J. Raiman, V. Rao, S. Satheesh, D. Seetapun, S.
Sengupta, K. Srinet, A. Sriram, H. Tang, L. Tang, C. Wang, J. Wang, K. Wang, Y.
Wang, Z. Wang, Z. Wang, S. Wu, L. Wei, B. Xiao, W. Xie, Y. Xie, D. Yogatama, B.
Yuan, J. Zhan, and Z. Zhu (2016), ‘Deep Speech 2: End-to-end Speech Recognition
in English and Mandarin’, in M. Balcan and K. Weinberger (eds), Proceedings of the
33rd International Conference on Machine Learning, Proceedings of Machine Learning
Research Vol. 48, 173–82, New York: Institute of Electrical and Electronics Engineers.
Babu, A., C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y.
Saraf, J. Pino, A. Baevski, A. Conneau, and M Auli (2021), ‘XLS-R: Self-supervised
Cross-lingual Speech Representation Learning at Scale’, arXiv:2111.09296v3 [cs.CL].
https://arxiv .org /abs /2111 .09296v3.
Baevski, A., H. Zhou, A. Mohamed, and M. Auli (2020), ‘wav2vec 2.0: A Framework for
Self-supervised Learning of Speech Representations’, arXiv:2006.11477v3 [cs.CL].
https://arxiv .org /abs /2006 .11477v3.
Chiu, C.-C., T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan,
R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani

37Noisy Data
(2018), ‘State-of-the-art Speech Recognition with Sequence-to-sequence Models’,
arXiv:1712.01769v6 [cs.CL]. https://arxiv .org /abs /1712 .01769v6.
Coats, S. (2019), ‘A Corpus of Regional American Language from YouTube’, in C.
Navarretta et al. (eds), Proceedings of the 4th Digital Humanities in the Nordic
Countries Conference, Copenhagen, Denmark, March 6–8, 2019, 79–91, Aachen:
CEUR. https://ceur -ws .org /Vol -2364 /7 _paper .pdf.
Coats, S. (2020), ‘Articulation Rate in American English in a Corpus of YouTube Videos’,
Language and Speech, 63 (4): 799–831. https://doi .org /10 .1177 /0023830919894720.
Coats, S. (2022a), ‘The Corpus of Australian and New Zealand Spoken English: A
New Resource of Naturalistic Speech Transcripts’, in Proceedings of the 20th Annual
Workshop of the Australasian Language Technology Association, 1–5, Adelaide:
Australasian Language Technology Association. https://aclanthology .org /2022 .alta -1 .1.
Coats, S. (2022b), ‘The Corpus of British Isles Spoken English (CoBISE): A New
Resource of Contemporary British and Irish Speech’, in K. Berglund, M. La Mela,
and I. Zwart (eds), Proceedings of the 6th Digital Humanities in the Nordic and Baltic
Countries Conference, Uppsala, Sweden, March 15–18, 2022, 187–94, Aachen: CEUR.
https://ceur -ws .org /Vol -3232 /paper15 .pdf.
Coats, S. (2023), ‘Dialect Corpora from YouTube’, in B. Busse, N. Dumrukcic, and I.
Kleiber (eds), Language and Linguistics in a Complex World, 79–102, Berlin and
Boston: Walter de Gruyter.
Coto-Solano, R., J. N. Stanford, and S. K. Reddy (2021), ‘Advances in Completely
Automated Vowel Analysis for Sociophonetics: Using End-to-end Speech
Recognition Systems with DARLA’, Frontiers in Artificial Intelligence, Section
Language and Computation. https://doi .org /10 .3389 /frai .2021 .662097.
Dunn, J. (2019), ‘Modeling Global Syntactic Variation in English Using Dialect
Classification’, in Proceedings of the NAACL 2019 Sixth Workshop on NLP for
Similar Languages, Varieties and Dialects, 42–53. https://aclanthology .org /W19
-1405.
Dunning, T. (1993), ‘Accurate Methods for the Statistics of Surprise and Coincidence’,
Computational Linguistics, 19: 61–74. https://aclanthology .org /J93 -1003 .pdf.
Eisenstein, J., B. O’Connor, N. A. Smith, and E. P. Xing (2014), ‘Diffusion of Lexical
Change in Social Media’, PLoS ONE, 9 (11): e113114. https://doi .org /10 .1371 /journal
.pone .0113114.
Fiil-Flynn, S. M., B. Butler, M. Carroll, O. Cohen-Sasson, C. Craig, L. Guibault, P. Jaszi,
B. J. Jütte, A. Katz, J. P. Quintais, T. Margoni, A. Rocha de Souza, M. Sag, R. Samberg,
L. Schirru, M. Senftleben, O. Tur-Sinai, and J. L. Contreras (2022), ‘Legal Reform to
Enhance Global Text and Data Mining Research’, Science, 378 (6623): 951–3. https://
doi .org /10 .1126 /science .add6124.
Google (2009), ‘Automatic Captions in YouTube’. Available online: https://googleblog
.blogspot .com /2009 /11 /automatic -captions -in -youtube .html.
Hannun, A., C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S.
Satheesh, S. Sengupta, A. Coates, and A. Y. Ng. (2014), ‘Deep Speech: Scaling up

Random documents with unrelated
content Scribd suggests to you:

Hän hävisi hetkeksi huoneeseen ja jätti siellä paavillisen anekirjeen
eräälle noista kahdestatoista saneltavaksi. Palasi sitten takasin
ikkunaan ja jatkoi keskusteluaan mustanveljen kanssa, jolla lienee
ollut hyvin huvittavaa kerrottavanaan, koska Franciscus väliin nauroi
niin ääneensä, että hän hytkähteli ikkunalaudalla; mustaveli oli niin
innostunut kertomiseensa, ettei hän huomannut, miten pikeä tippui
hänen valkoselle vaipankaulustallen.
Kopioimista jatkettiin sillävälin entisellä hitaisuudella
kirjoitussalissa. Se oli suuri huone, yhdistetty yhdeksi kahdesta eri
huoneesta; vielä näki muurikaaren seinässä, joka, samoin kuin kaksi
pilaria, osotti missä väliseinä oli ollut. Eteläisen seinämän viereisen
pöydän ympärillä istuivat, kuten on mainittu, nuo kaksitoista
jäljentäjää, mutta peremmässä osastossa oli luostarin kirjasto.
Pitkillä, kapeilla pöydillä oli siellä muutamia satoja kirjoja, vitjoilla
kiinnitettyinä pöydän jalkoihin.
Siinä oli kokoelma dekretaaleja eli roomalaisen kirkon
perustuslakeja, legendoja, messukirjoja, evankelioita ja psalttareita,
toiset taiteellisemmin, toiset huonommin kirjailtuja ja koristeltuja
maalattuine ja kullattuine alkukirjaimineen. Näiden aarteiden
kohdalla oli erityinen pöytä ja sen edessä nojatuoli, jossa eräs
harmaaveli istui ja nukkui. Hän oli kirjastonhoitaja. Levottomain
aikain vaikutuksesta ja osaksi Saksassa heränneen Hussilais- ja
talonpoikaissodissa ilmenneen myrskyn takia oli Harmajain veljesten
päähallitus — toivoen saavuttavansa kansan suosion, jota se tulojen
takia tarvitsi — julkisesti kuuluttanut, että luostarin kirjasto oli
yleisön käytettävänä kahtena päivänä viikossa, tunnin
kumpasenakin. Tätä hengellisten lahjain kaunista ja aulista
tarjoilemista oli yleisö kumminkin pitänyt jommoisenakin korutekona,
koska harvat saattoivat käyttää kirjastoa hyväkseen, kun se oli auki

ainoastaan parhaalla työajalla, ja koska vielä harvemmat osasivat
lukea, sekä vihdoin, koska oli kovin tarkoin valikoitu, mitä kirjoja
pidettiin esillä yleisön käytettävänä.
Kirjastonhoitajan virka ei siis ollut rasittava, päinvastoin hyvin
haluttu, sillä siinä hän sai hyvän tilaisuuden levätä häiritsemättä.
Eikä veli Martin ollutkaan rasitetun näköinen, nukkuessaan siinä
istualtaan. Hänen kasvoistaan, jotka olivat valahtaneet rintaa
vastaan, saattoi lukea mitä suurinta maallista tyytyväisyyttä.
Terveellinen hiki helmeili kiiltävältä otsalta ja hänen leukansa lihavilla
laitumilla retkeilivät kärpäset rauhassa. Väliin hän maiskautti kieltään
ikäänkuin jos hän olisi nähnyt unta makeasta ruoasta, väliin kohotti
hän kätensä huiskauttaakseen pois kärpäsen, joka kutkutti hänen
suupieltään, mutta käsi ei koskaan kohonnut kuin puoliväliin, siitä se
retkahti takasin alas polvelle. Kun sitten kärpäset jättivät hänen
hetkeksi rauhaan, pannakseen pisteitään auki oleviin suuriin
pergamenttikirjoihin, vaipui hän oikein syvään uneen; säännöllisesti
ja hyvässä järjestyksessä hän hengitti, puhalsi voimakkaasti hengen
ulos ja sai siten soimaan huultensa lomassa hyvin musikaalisen
sävelasteikon.
Sillävälin oli mies, jolla ei ollut munkkiveljeskunnan pukua, tullut
saliin ja pysähtynyt ovelle. Hän oli pitkä, voimakas, keski-ikäinen
mies, tukka paksu kuin jalopeuralla ja kaunis täysiparta, joka valahti
alas rinnalle. Hän näytti varakkaalta käsityöläiseltä, joka itse tekee
työtään, sillä hänen kätensä olivat sangen karkeat, siellä täällä
mustissa pilkuissa.
Varovasti katseli hän ympärilleen ja tarkasteli terävillä silmäyksillä
nukkuvaa munkkia. Sitten hän rykäsi.

Nukkuja teki levottoman liikkeen, ikäänkuin hän olisi nähnyt
jotakin ilkeää unta. Hänen hengityksensä hiljeni ja kotvasen kuluttua
hän aukasi lihavat silmänsä.
— Mitä te tahdotte?
— Anteeksi, hurskas isä, halusin ainoastaan katsella erästä kirjaa.
Veli Martin, joka jo oli saanut hierastuksi unen silmistään ja käynyt
kärtysälle tuulelle, huomasi nyt tuon tungeskelevan miehen tomuset
ja kuluneet kengät. Ahaa! ajatteli hän, siinä on taas yksi.
— Mene ulos ja pyyhi jalkasi, huudahti hän. Mies meni oven
ulkopuolelle, pyyhkäsi lakillaan kenkiään ja tuli takasin.
Tämäpä ei ollut veli Martinin tuumain mukaista. Hän haki uutta
poisajosyytä.
— Sinäkö tahdot katsella kirjaa, etkä ole oppinut pesemään
käsiäsi.
Nuo kaksitoista puhtaaksikirjoittajaa, jotka kuulivat tuon
muistutuksen, purskahtivat nauramaan.
Vaan mies oikasi vartensa suoraksi ja virkkoi:
— Olen oppinut sen taidon ja käyttänyt sitä myöskin; mutta työ,
herraseni, tekee käden tummaksi, eivätkä kumminkaan ole laiskan
valkoset kädet aina puhtaammat.
Martin kohotti kätensä ja pureskeli lyhyviksi leikattuja kynsiään.
Sitten kääntyi hän kirjurein puoleen ja virkkoi:
— Hereticus ille! [Tuo on kerettiläinen.]

— Licet inspiciat! Est homo impudicus et valdepeniculosus, qvam
in oculis habere opus est, virkkoi kirjurien esilukija. [Antaa hänen
katsella (kirjoja). Hän on häpeämätön ja vaarallinen mies, jota täytyy
pitää silmällä.]
Ovella seisova mies loi katseensa alas, — tekikö hän sen
salatakseen ivahymynsä munkkien huonon latinan johdosta vaiko
häveten, ettei hän ymmärtänyt tuota klassillista kieltä, sitä oli vaikea
päättää.
— Astu lähemmäs vain, saat katsella mitä täällä on esillä.
— Onko pyhä raamattu esillä? kysyi mies nöyrästi.
Martin avasi silmänsä suuriksi, sillä raamattua pidettiin kiellettyjen
ja siis kätkettyjen kirjain joukossa, mutta tätä hän ei tahtonut
myöntää ja siksi lausui hän jotakin muuta.
— Onko? Onpa kyllä! Mitä toisintoa haluatte? Septuagintaa,
vulgata, kaldaica, graeca, syriaca, mikä niistä? Sanokaa vain.
— Olen hyvin kiitollinen jos saan katsella kreikkalaista, virkkoi
mies.
Martin oli sen näköinen, kuin jos hän olisi saanut iskun otsaansa ja
puhtaaksi-kirjoittajat seisauttivat hetkeksi kynänsä. Johan nyt oli
luostarin, korkean opin ja valkosten käsien arvo ja kunnia vaarassa.
Martin koetti vielä erästä mutkaa pelastaakseen sen.
— He kaine dietäke? kysyi hän.
— Hä kainä diatäkä, vastasi mies oikasten munkin lausumistavan.

Nyt kävi oppinut munkki hämilleen, tuo mies varmaankin salaa
oikean karvansa.
— Se on tällä kertaa lukon takana, vastasi hän, ja pyhäin kirjain
hoitaja on matkustanut pois vieden avaimet mukaansa. Mutta ettekö
suvaitse katsella niitä epistoloita, jotka täällä ovat esillä.
Mies kiitti ja lähestyi äänetönnä kirjapöytää, jonka ääressä hän
pian vaipui lukemaan.
Martin kävi raskasmieliseksi, mietti ja aprikoi ja olisi epäilemättä
taas vaipunut unetarten valtaan, ellei äkkinäinen huuto ikkunan
äärestä olisi häntä nostattanut. Veli Franciscus, joka ikkunasta oli
tarinoinut mustanveljen kanssa, hypähti näet esiin ja huusi:
— Oletteko kuulleet, oletteko kuulleet? Oletteko kuulleet siitä
uudesta keksinnöstä, joka on tehty Saksassa?
Ei, ei kukaan ollut kuullut.
— Joo, saksalaiset ovat tehneet sellaisen keksinnön, ettei kirjoja
enää tarvitse kopioida. Se kai teitä ilahduttaa? sanoi hän kääntyen
kirjurien puoleen, jotka laskivat kynät käsistään ja olivat vilpittömän
iloisen ja innostuneen näköisiä.
Kirjapöydän ääressä istuva mies jännitti kuulonsa, vaan oli
kumminkin lukevan näköinen.
— Kerro veli, kerro, huusivat kirjurit ja veli Martin, jonka tämä
uutinen oli saanut liikkeelle mukavasta tuolistaan.
— Asianlaita lienee yksinkertaisesti ainoastaan se, että leikellään
kirjaimet puuhun kuin leimasimet ja niistä ladotaan sanat kokoon.

Pettymyksen ja hämmästyksen huudahdus kuului vastauksena
tähän tiedonantoon.
— Eikö muuta! Mutta varmaankin käy sittenkin kirjoittaminen
nopeammin.
— Siitäpä se nyt on kysymys, väitti Franciscus. Kun kerta on
leikannut kirjaimet, niin niillä voi sitten leimata kymmenentuhatta
kertaa.
Munkit puistelivat päätään ja näyttivät sangen epäuskoisilta.
— Sitäpä sietää ensiksi nähdä, ennenkuin uskoo. Tätänykyä on
ilmassa niin paljo kaikenlaista uutta.
— Niin, se on kyllä totta, jatkoi veli Franciscus. Ette liene kuulleet
arkkipiispan viime urostöistä. Se on uskomatonta, mutta totta se on.
Incesteri! Ajatellappa nyt vanhaa ukonkääppänää!
Yleisellä naurulla tervehdittiin tätä uutista Ruotsin kirkon
päämiehestä ja hänen hieman vapaasta elelemisestään.
Keräyttiin veli Franciscuksen ympärille saamaan lähempiä tietoja
noista jännittävistä yksityisseikoista. Vaan samassa kuului kello
kilahtavan.
— Nyt uimaan, huudahtivat nuo kaksitoista ja viskasivat kynät
pöydälle.
— Ensiksi on messu luettava, muistutti isä Martin tehden
ristinmerkin, joka kumminkin tuli vatsan kohealla tehdyksi.
— Sille me nyt — —

Martin keskeytti ajoissa nuoren munkin lauseen, sillä samassa
rykäsi muukalainen kirjojen ääressä. Hänet oli keskustelun innossa
kokonaan unhotettu. Nyt kävi veli Franciscus levottomaksi ja hän
jatkoi toiseen suuntaan tuon katkenneen lauseen:
— Sille me nyt, kuten ainakin, panemme suuren arvon.
Ja hän lukea lorutti siinä seisoessaan Ave Marian, joka tuli kuin
myllyn torvesta, ja jonka loppusanat "secula seculorum" kaikki
kertasivat langeten silmänräpäykseksi polvilleen.
Muukalainen ei tehnyt ristinmerkkiä eikä langennut polvilleen,
vaan poistui hiljaa ja siivosti.
— Mikäs piru se tuo oli? kysyi Franciscus vihasesti muukalaisen
poistuttua.
— Olipahan kirjastossa kävijä.
— Etkö sinä voi pitää roistoväkeä loitommalla?
— Voin kyllä, vastasi Martin hämillään, mutta tämä ei ollut
roistoväkeen kuuluva. Hän osasi kreikankieltä, arvattavasti latinaa
myöskin.
— Mutta ei tehnyt ristinmerkkiä! Pitäkää silmällä sitä miestä, se on
varmaankin kerettiläinen!
Verkalleen laskeusi munkkiseurue uimaan Mälariin ja sitten
syömään maukkaan päivällisensä.
* * * * *

Harmaaveljekset olivat uineet, sitten menneet viileään ruokasaliin
ja syöneet siellä tuoretta haukia sekä mansikkamaitoa. Ruoan
jälkeen oli nuoremmille annettu viittaus, että he saisivat lähteä
järvelle soutelemaan, mutta sen luvan he käyttivät siten, että
menivät puutarhaan nukkumaan. Veli Martin, Franciscus ja
muutamat muut vanhemmat jäivät istumaan ruokasaliin ja
kannattivat sinne hyvää klarettiviiniä, jonka lääketiedettä taitava veli
sanoi terveelliseksi niitä tauteja vastaan, jotka kovan kuumuuden
seurauksina raivosivat kaupungissa. Viini avasi sydämmet ja hellitti
kielet. Oltiin kohta vilkkaasti keskustelemassa menneistä muistoista,
nykyisistä oloista ja tulevista toiveista. Ei ollut asiaankuulumatonta
yhtään saapuvilla sitomassa puheen vapautta, toinen tiesi mitä
toinen ajatteli, joten seurustelu kävi sitäkin herttasemmaksi. He
nauttivat niinkuin taiteilija kappaleen päätyttyä, niinkuin naamioittu,
kun hän saa laskea pois naamarinsa, niinkuin syömäri, kun saa
hellittää suolivyötään.
— On se sittenkin myönnettävä, että kirkko on kovin vanhentunut
laitos, virkkoi Martin.
— Semmoisena kuin se nyt on, myönnettäköön se, vastasi
Franciscus, mutta sillä on ollut ajatus, silläkin, kerran. Sillä kuten
keisarin tuli valvoa kansojen aineellista etua tuli paavin valvoa niiden
henkistä, mutta koska maallisella vallalla on runsaampi puoli
huolehdittavanaan, tulee keisari elämään kauemmin kuin paavi.
— No niin, paavin päivät ovat luetut ja kirkon myöskin, kun ne
kerran ovat niin pahasti paljastaneet itsensä. Rooman paavi julistaa
Avignonin paavin pannaan ja päinvastoin; maallikkojen mielestä on
hauska kuulla noita haukkumasanoja eikä kukaan nyt enää usko
paavia.

— Ihmettelenpä, ovatko ihmiset koskaan oikein uskoneet kirkkoa
ja sen oppeja. Mitähän nuo, esimerkiksi, ajattelevat niistä
epäsiisteistä kuvista, joita kuvanveistäjät huvikseen ovat veistelleet
vanhoihin kirkkoihin, tuo munkki ja nunna esim. Linköpingissä? Ja
mitä ovat he ajatelleet aasijuhlista ja karnevaalesta, joita papisto
suvaitsee, ja joissa seurakunta tekee ivaa jumalanpalveluksesta aina
alttariin saakka. Eikö siinä jo jotenkin selvästi tunnusteta kirkon
heikkouksia. Näemmehänkin kaikkialla tuota puoleksi peitettyä
epäilystä kirkon kaikkivoipaisuutta kohtaan ja kun Jumala kerran on
luonut sekä lihan että hengen, niin onhan siinä jo myönnytys, että
lihallakin on oikeutensa.
— Totta puhut! Pakanuus pilkistelee taas esiin sieltä ja täältä. Sillä
pakanuudessa oli myöskin ajatus, suuri ajatus, joka ei koskaan kuole
pois: se palveli luonnonvoimia ja ne ovat ikuiset. Kristinusko palvelee
ihmistä, joka on kuolevainen. Tiedämmehän miten ensimmäiset
kristityt Roomassa rakensivat uudelleen vanhat temppelit kirkoiksi —
mitäpä kristitty kirkko onkaan muuta kuin kreikkalainen temppeli,
jonka katto on tehty kupevaksi? Tehtiinhän vanhoista Apollonkuvista
Kristuksen kuvia. Luonto ei suvaitse mitään jaksotonta kehitystä,
sanoo vanha pakanallinen filosoofi ja hän on epäilemättä oikeassa.
— Veljethän puhuvat kuin pakanalliset filosoofit, huomautti nyt veli
Antonius, keski-ikäinen munkki, jonka piirteet olivat vilkkaat ja
tarmokkaat. Jos joku meitä kuulisi, niin eipä olisi meillä pitkälti
roviolle, se on varmaa.
— Mutta nyt ei kukaan kuule, väitti Martin. Ja mitä me täällä
puhumme, onhan se samaa mitä kaikki ajattelevat.
— Niinhän se melkein Huss'kin ajatteli ja siksi hän poltettiin,
vastasi Antonius.

— Kukapa takaa ettei Huss ollut oikeassa, väitti Martin.
— Kukapa piru sen takaa? toisti Franciscuskin ja kohotti maljansa,
johon kaikki yhtyivät äänekkäästi nauraen, jotta korkeat ristiholvit
kajahtelivat.
— Maailma tahtoo tulla petetyksi, jatkoi hän yltyneenä suosiosta.
No hyvä, pettäkäämme sitä. Rooman paavilla on kaksikymmentä
jalkavaimoa ja Upsalan arkkipiispa on tehnyt sukurutsauksen. Mutta
tämä ei estä paavia eikä piispaa antamasta syntejä anteeksi sille,
joka maksaa niin taikka niin suuren rahasumman! Onhan tässä
äärettömän selvä ihmisyyden piirre: minä annan anteeksi itselleni,
ergo annan minä anteeksi muillekin.
— Mutta Hussille ei annettu anteeksi, intti vielä Antonius.
— Hän ei tahtonut anteeksi, vastasi Franciscus, eikähän lahja voi
muuttua lahjaksi, ellei ole vastaanottajaa.
— Maailma tahtoo tulla petetyksi, sanoit äsken, jatkoi Antonius.
Eiköhän se mahda olla vanha vale! Eiköhän olisi oikeampaa sanoa
näin: Maailma on petetty, siis pitäisi meidän sitä valaista. Sitä
mieluummin, kuin maailma maksaa meille siitä, että me sitä
valaiseisimme.
— Mutta emmekö me sitä sitten tee? väitti kirjastonhoitaja Martin.
Emmekö ole avanneet kirjastomme sille ja panneet esille
kirjojamme?
— Olemme, olemme asettaneet esille kirjat, joista ei meille vaaraa
ole, vaan vaaralliset olemme pistäneet lukkojen taa.
— Vaan eihän saa antaa lasten leikkiä veitsillä.

— Ei lasten! Mutta tässä on kysymys täyskasvuisista.
— Kuka voi sanoa minulle, keskeytti Franciscus, joka pelkäsi että
keskustelu livahtaisi vaaralliselle alalle, kuka voi sanoa mitä
tarkoitetaan sillä, kun paratiisin puuta kutsutaan hyvän- ja
pahantiedonpuuksi?
— Luulisinpa, vastasi Martin, sen merkitsevän sitä, että tieto tekee
ihmiselle hyvää sen kautta, että se tekee hänet voimakkaaksi ja
pahaa sen kautta, että se tekee hänelle — kuinka sanoisin? —
pahaa!
— Minä taas luulisin, väitti Antonius, hyvän- ja pahantiedonpuun
merkitsevän sitä, että tieto kantaa sekä hyviä että huonoja hedelmiä,
sillä mikä on yhdelle hyväksi, se on usein pahaksi toiselle.
Otaksutaan, että joku menisi ilmoittamaan kansoille, ettei paavi
voikaan antaa syntejä anteeksi. Se tieto tekisi kansoille hyvää, mutta
paavi ei silloin voisi valloittaa eikä ryöstää kaunista Siciliaa, kuten
hän nyt tekee, sillä hänellä ei olisi varoja sodankäyntiin. Eikä olisi
kovin lihavat päivät meilläkään, jotka saamme osamme tuosta
kannosta, mutta sicilialaisilla olisi hyvät päivät, sillä he saisivat olla
rauhassa ja saisivat korjata ratsuhevosten tallaamattomina
viinisatonsa. Kun siis nyt paavin lähettiläs parin viikon perästä tulee
tänne myömään aneita sotaa varten Hussilaisia vastaan, olisi meidän
annettava kansalle hyväntiedonpuusta hedelmät.
— Ei ei, nyt menet Antti liian pitkälle, nyt menet hiiteen. Eihän
tässä toki lie tarkoitus toisiaan kavaltaa. Pitää pysyä yhdessä.
Yhdessä!
— Niin, ei se kannata karsia vitsoja oman selkänsä varalle,
kannatti Franciscuskin. Jos kirkko on mätä, niin lutistuu se itsestään

kokoon, ei meidän tarvitse sitä töykätä.
— Olisihan epärehellistä, jos me ilmiantaisimme rikostoverimme,
kun itsekin olemme rikollisia, lisäsi Martin.
— Vaan entäs jos muut sen tekevät, intti Antonius.
— Se on taas toinen asia.
— Minkä me ilmiantajalle siinä tapauksessa teemme? kyseli tuo
itsepäinen Antonius.
— Me? Me teemme mitä meidän on käsketty. Ei meidän ole
määrättävä, mitä tehtävä on. Ja se, joka puhdas on, viskatkoon
ensimmäisen kiven.
— Silloin ei koskaan yhtään kiveä viskattaisi, jatkoi Antonius.
— Miksikä pitäisi sitten viskellä kiviä? keskeytti Franciscus taas.
Eikö tässä voi elää kiviä viskelemättä? Minä ainakin voin. Mutta jos
veljet suostuvat ehdotukseeni, menemme puutarhaan viskelemään
sen sijaan keiloja. Ilta rupee käymään viileäksi, liikkuminen
tyynnyttää vertamme, niin että voimme hyvin nukkua. Onko kellään
sitä vastaan?
Ei ollut kellään vastaan ja kohta olivat luostarin vanhimmat
puistossa keiliradalla. Pallot jyrisivät ja keilat kaatuilivat hilpeästi.
Mutta selältä souti lehtipukuisia venheitä, jotka palasivat
huvimatkoilta luonnon helmasta. Nuoret tytöt lauloivat ja pojat
viiltelivät viulua. Vanhat miehet ja naiset kuuntelivat soittoa ja
katselivat vakavina, kuinka ilta-aurinko laskeusi taivaan rannalle.

Yksi venhe souti aivan läheltä luostarin muuria.
— Mitähän nuo pyhät miehet tekevät siellä näin kauniina iltana,
arveli eräs emäntä.
— Kuulostaa siltä, kuin vedättäisivät he halkoja puista siltaa
myöten, vastasi muuan vanha mies.
— Taikka lierittäisivät pääkalloja lautakattoa pitkin, lisäsi muuan
nuorukainen.
— Se kuuluu niin kamalalta, huomautti nuori tyttö.
— Ei niillä raukoilla ole kovin hauska, jatkoi tuo vanha nainen ja
nyökytteli merkitsevästi päätään.
Tuokion kuluttua soivat Harmaaveljesten luostarikirkon
kirkonkellot. Airot seisattuvat, viulut vaikenevat ja venheet lipuvat
ääneti selällä. Miehet paljastavat päänsä ja naiset tekevät
ristinmerkin. Pieni kirkonkello kalkuttaa herkeämättä, ikäänkuin sillä
olisi kiire, ja toinen kello Mustainveljesten luostarista vastaa sen
ääneen, ikäänkuin ymmärtäen, mikä harmaaveljeksillä on mielessä.
Ja yhtaikaa molemmat vaikenevat ja harmaaveljesten luostarimuurin
korkeiden kivimuurien yli kuuluu iltalaulu "salve regina", terve
taivaan jumalatar, nuorten munkkien sitä laulaessa. Se kajahtaa
veden yli ja etelärannan vuoret vastaavat kumahdellen.
Vanha nainen venheessä pyyhkii silmänsä ja perässä istuva vanha
mies nostaa kasvonsa taivasta kohden, ikäänkuin nähdäkseen noita
kauniita säveleitä.
Mutta kun viimeiset säveleet ovat vaijenneet laskeutuvat airot taas
veteen ja taas rupee siellä jyrinä käymään luostarin pihalla.

— Kummapa, kun niiden täytyy tehdä työtä näin myöhään illalla,
lausuu tuo vanha nainen.
— En minä usko niiden työtä tekevän, arvelee nuorukainen.
— Mitä luulet heidän sitten jyristävän? kysyy tyttö.
— Minä luulen että ne peijakkaat heittävät keilaa, kuiskaa
nuorukainen tytön korvaan.
* * * * *
Rautatorin varrella oli toisten rakennusten lomassa kapea, kahden
ikkunan levyinen talo, neljän kerroksen korkuinen. Alimmaisessa
kerroksessa oli kumminkin ainoastaan yksi ikkuna, sillä toisen tilan
anasti portti, joka oli hyvin kapea ja matala ja avasi tiehyeen yhtä
kapeaan ja matalaan porttikäytävään. Portin yläpuolelle oli muurattu
vuolukivestä muovailtu eläimen pää, joka yhtä paljo oli lohikäärmeen
kuin nahkasiipisen pään näköinen, mutta jonka silmät olivat
selkiseljällään ikäänkuin ne olisivat tarkastelleet kaikkia, mitä torilla
tapahtui; suu sillä oli auki juurikuin olisi aikonut puhua, mutt'ei
saanut hämmästykseltä sanaa suustaan. Eniten näytti sen
tuijottavain silmäin huomiota kiinnittävän kaakinpuu, joka oli keskellä
toria. Tämä oli nerokas keksintö, siinä oli yhdessä häpeäpaalu, jossa
pienempiä lainrikkojia piiskattiin ja hirsipuu, jossa suurempia
hirtettiin.
Alakerran ikkunana oli puotinikkuna, jonka edustaiselle lavalle oli
pantu näytteitä kirjapainotaidon tuotteita semmoisina kuin tämä
taito esiintyi ennen Gutenbergin aikoja. Sellaisia olivat aapiskirjat,
puuveistoksilla painetut raamatulliset kuvat, joiden alla oli lyhyt
teksti, almanakat, korttipakat y.m. sentapaiset. Kaikki nämä olivat

leikatut puutauluihin ja siitä oli painos kädellä vedetty ilman mitään
painoa; tätä taidetta oli Euroopassa harjoitettu sata vuotta ennen
Gutenbergia. Mutta kun tämä painotapa oli kallis ja vaivaloinen, oli
se tuskin voinut kilpailla puhtaaksikirjoittajain teosten kanssa ja
nämä pitivät sitä siis hyvin vaarattomana kilpailijana.
Puoti-ikkunan sisäpuolella, joka oli järjestetty tiskiksi, istui sen
omistaja Hannu, ammattinimeltään "kirjemaalari" ja harjoitti
ammattiaan pienen pöydän ääressä, josta hän hyvästi näki tiskille ja,
tarvitsematta nousta paikoiltaan, saattoi tehdä kauppaakin ostajain
kanssa. Kylttinään käytti hän seivästä, jonka nokkaan hän joka päivä
ripusti uuden, tavallisesti väritetyn, painokuvan, joka veti puoleensa
oppipoikain ja piikojen huomion, uudelleen joka päivä. Väliin siinä
kuvattiin syntiinlankeemusta, väliin sitä tai tätä pyhimystä, kädessä
haarasauva, ratas tai miekka, väliin taas aivan maallisiakin kuvia
jostakin ulkomaisesta päivän tapahtumasta, joka aina keräsi
ympärilleen sakean lauman uteliaita katselijoita.
Tänä aamuna, moniaita päiviä kirjemaalarin edelläkerrotun
luostarissa käynnin jälkeen, oli hän ripustanut tankoonsa kuvan,
jossa kaksi paavia tappeli samasta tuolista ja alempana oli
kansanjoukko, jonka suista lähti suikaleita, varustettuina
kirjoituksella: Anathema! Taulu kuvasi siis molempain paavien
taistelua vallasta, sekä kansaa, joka vaati heitä pannaan.
Muuan porvari kävi myymälän ohi, näki kuvan, pysähtyi ja katsoi.
Sitten sanoi hän kirjemaalarille:
— Mitä julkeutta tämä on?
— Niin, kas semmoista julkeutta ne nyt harjoittavat pyhät isät
Roomassa ja Avignonissa, vastasi Hannu.

— Minä kysyn, kuinka uskallat tehdä tällaista häväistystä, piirtää
pyhän kirkon päämiesten irvikuvia.
— Sen selitän sitten myöhemmin, vastasi Hannu.
— Varo vain itseäsi, uhkaili porvari. Vähää myöhemmin käveli
siihen tuo nuori mies joka oli istunut lehtiin puetussa venheessä
nuoren tytön vieressä. Hän oli nimeltään Niegels ja oli raastuvan
kirjuri.
— Päivää, Hannu, virkkoi hän, mitä ukkoja sulla on tuossa? Ahaa,
Rooman pyhät! Miks'et ota meidän omia kotoisia pyhiä?
— Luulisin sen olevan haitaksi asialle, jos kävisi suorastaan
henkilöiden kimppuun.
— Mutta mitäpä voit asian hyväksi tehdä, kun joka paikassa on
henkilö.
— Etkö käy sisään, sanoi Hannu. Miksi siellä ulkona huutelet?
Niegels astui myymälään ja istui kohta jakkaralle Hannun viereen.
— Oletko kuullut, että arkkipiispa on varastanut tuomiokirkonkin
rahaston? puhui Niegels.
— En ole sitä kuullut, mutta kyllä muuta inhottavaa.
— No, luuletko asian tulevan sen kautta autetuksi, että yleensä
saarnataan varkautta vastaan, luuletko siitä olevan hyötyä, että
seitsemättä käskyä julkiluetaan. Ei, tässä täytyisi osottaa kansalle,
että hänen pyhyytensä, joka tuomitsee ihmisiä kuolemaan

kirkonvarkaudesta, itse on kirkonvaras, vaikka kansa siitä huolimatta
osottaa hänelle miltei jumalallista kunnioitusta.
— Onko hän todellakin varastanut?
— On. Hän ei ainoastaan ole sitä tunnustanut, vaan hän kehuukin
siitä. Hän väittää, että on olemassa erityinen siveyslaki niitä varten,
jotka johtavat kansojen kohtaloita; hän väittää, että he ovat
porvarillisten lakien yläpuolella ja että näitä lakeja voidaan
"ylevämpiä tarkoituksia varten" syrjäyttää. Tiedätkö, että
harmaaveljekset ovat perustaneet keiliradan ja että luostarin abotti
on tullut kotiin ja pitänyt pääalttarilta messun juopuneessa tilassa,
tiedätkö, että mustillaveljeksillä on roomalainen saunalaitos eräässä
maanalaisessa holvissa ja että he viimeisillä pääsiäisrahoilla ovat
ostaneet viisikymmentä tynnyriä parasta Espanjanviiniä? Kerro
kaikesta tästä, niin teet kansalle palveluksen, josta se on sinua
kiittävä.
— Joudun roviolle. Mutta siitä vähät, vaan tällaiseen toimeen
pitäisi olla suuri ja nuhteeton mies.
— Suuria miehiä ei ensiksikään ole olemassa, sillä me olemme
kaikki pieniä, ja nuhteettomia toiseksi ei löydy, sillä me olemme
kaikki syntisiä, ja kolmanneksi olet sinä vähemmin syntinen kuin
muut. Sinä olet paennut Varhemin luostarista — siinä teit oikein —,
mutta halpamaista tekoa et ole koskaan tehnyt.
— En ole tehnyt, mutta en tahtoisi tuomita, ettei minuakin
tuomittaisi!
— Ei tässä ole tarkoitus tuomita, tässä on vain tarkoitus tehdä
tuomarit polkeenalaisiksi. Sillä he ovat asettuneet tuomitsemaan, et

sinä.
— Pitkinä öinä olen tätä asiaa ajatellut; kuulen toisinaan äänen,
joka minua kutsuu, mutta en pidä itseäni sen arvoisena, että
asettuisin profeetaksi, siihen pitäisi olla pyhä mies.
— Ei ole olemassa pyhiä miehiä! Ja vielä sanon sulle toisen asian;
sinun pitäisi tuntea itsesi rikolliseksi, kun vaikenet, sillä se, joka ei
ilmianna rikosta, jonka hän tuntee, hän on itse syypää.
— Miksi juuri minun pitäisi se tehdä?
— Siksi että sinä olet saanut lahjat! Luuletko ihmisille annettavan
lahjoja hänen omaksi huvikseen? — Vaan mikä sulla tuossa on?
Hannu oli koko keskustelun ajan sahaillut kaiverrettua puutaulua
pieniin palasiin.
— Se on uusi keksintö, josta luultavasti tulee olemaan hyötyä
meidän asiallemme. Sanon meidän, Niegels, koska luotan sinuun! —
Näetkös näitä pieniä puuneliöitä, joita sahailen irti. Kun ne makaavat
noinikään sekasin yhdessä läjässä, eivät ne ole vaarallisemmat kuin
epäjärjestyksellinen kansanjoukko, jonka pari tusinaa ratsumiestä voi
polkea jalkoihinsa. Mutta jos järjestän nämä riveihin ja ladon yhteen
ne, jotka yhteen kuuluvat, niin käyvät ne peljättäviksi kuin
kansallinen sotajoukko ja jos sitten panen lipunkantajan huippuun,
rientävät ne rohkeina ja halukkaina taisteluun. Ne tulevat aina
voittamaan, jos niitä hyvin johdetaan.
— Ne ovat kirjaimia, vai miten?
— Ne ovat kirjaimia. Kirjaimista syntyy sanoja ja sanojen kautta
lausutaan ajatuksia.

— Niin yksinkertaista! Eikä sitä kukaan ennen ole keksinyt!
— Olisi se kai keksitty jo ennenkin, jos sitä ennen olisi tarvittu.
— Mistä olet tämän oppinut? kysyi Niegels, joka kävi miettiväksi ja
hypelteli punasta partaansa.
— Olen oppinut sen paikassa, jossa muuten ollaan tiedoistaan
hyvin ahnaina. Harmajainveljesten luostarin kirjasalissa. Ja samassa
paikassa kuulin kirjurien riemastuvan.
— Miksi riemastuvan?
— Koska he luulevat kirjoitustyön nyt loppuvan.
— Eikö sitten enää tarvitseisi kirjoittaa, sepä olisi mainiota?
Vieppäs tuo keksintösi kaupungin pormestarille, niin pääsen minä
kirjoittelemasta heidän julistuksiaan ja taksojaan. Siitähän koittaa
ihana aika!
— Sen teenkin, mutta mulla on sillä toiset tarkoitukset, suuremmat
tarkoitukset!
Mustaveljesmunkki kulki myymälän ohi ja pysähtyi katsomaan
paaveja kuvaavaa taulua. Hän loi Hannuun vihaisia silmäyksiä, joita
tämä ei kumminkaan huomannut, repäsi äkkiä taulun alas ja käveli
tiehensä.
— Sinä, Niegels, jatkoi Hannu, sinä olet nuori ja uskollinen. Oletko
varma, että aina tulet pysymään yhtä herkkänä ja palavana? Etkö
pelkää, että tulet katsomaan asioita toiselta kannalta, kun joudut
naimisiin tuon rikkaan porvarin tyttären kanssa?

— Minä? En koskaan. Sanonpa sulle vielä seikan. Siinä taistelussa,
jota olen kehottanut sinua alkamaan mädännyttä kirkkoa vastaan,
joka levittää oppeja, joita papit eivät itsekään usko, siinä ovat sinun
puolellasi kaikki valistuneet kansalaiset, niin onpa mulla mahtavia
ystäviä itse munkkienkin joukossa. Tapaan toisinaan erään
fransiskaanimunkin nimeltä Antonius; hän puhuu paljo näistä
asioista… Vaan sitä ei tarvita. Käy eteenpäin sinä ja saat nähdä,
kuinka joukot järjestyvät perässäsi, ja niiden joukossa näet
ensimmäisinä Niegelsin. Vaan minun täytyy lähteä, näen morsiameni
tulevan tuolta katua alas. Jää hyvästi!
Niegels meni, vaan kun hän tuli kadulle, huomautti hän Hannulle,
että kuva oli poissa tangon päästä. Hannu synkistyi, vaan kiinnitti
heti tankoon toisen samanlaisen taulun.
Ja sitten istui hän taas työhönsä, järjesti riveihin noita pieniä
puupulikoita erään matalan laatikon pohjalle.
Myymälästä pihalle päin vievä ovi aukesi ja sieltä astui sisälle
hänen vanha sisarensa, joka hänen vaimonsa kuoltua hoiti taloutta
hänelle ja hänen pojalleen.
— Oliko tuo Niegels taas täällä? kysyi sisar. Kuulin hänen kavalan
äänensä.
— Oli hän täällä. Mutta elä ole epäluuloinen häntä kohtaan.
— Varo itseäsi, hän on kettu. Te puhelitte niin äänekkäästi, että
minä kuulin joka sanan. Hän kyllä tarkoittaa mitä hän sanoo, mutta
huomenna hän peruuttaa kaikki. Elä luota häneen. Mutta minä
tiedän, että mitä teet, se on oikein; ja jos teet oikein, voit tehdä

tehtäväsi yksin. Kärsi Kristuksen vuoksi, mutta elä kärsi toisten
koston puolesta eläkä ajattele tulevaa voittoa.
— Ei niin, Katri kulta, elä niin puhu. Sen kärsimme, minkä
tekomme ovat ansainneet, emme enempää. — Oletko nyt hyvä ja
istut täällä myymälässä, kunnes minä pistäyn pormestarin luona
tunniksi, pariksi. Ja pidä silmällä kylttiä, sitä tänään näkyy ihmisillä
tekevän mieli.
Hän vaihtoi takkia ja meni raastupaan, jossa hän pian tapasikin
pormestarin. Tämä oli hyväntahtoinen mies, joka kernaasti olisi
tahtonut, vaan ei voinut, tehdä mitään valaistakseen kansaa
näkemään kaikkia niitä petoksia, joita papisto ja munkit totuuksina
tarjoilivat. Hän kuunteli tarkkaavasti Hannun kertomusta uudesta
keksinnöstä ja sen eduista ja lupasi heti koettaa käyttää sitä
raastuvan tarpeisiin. Sitten siirtyivät he vakavimpiin
keskusteluaineisiin.
— On kuulunut valituksia sinun kylttikuvistasi, virkkoi pormestari.
Sinun täytyy olla varovainen.
— En voi olla.
— Ystäväsi sanovat, että turmelet asiasi piirustuksillasi.
— Jos kansa osaisi lukea, kirjoittaisin sille kirjoja; minun täytyy
puhua sen kieltä, jotta se minua ymmärtäisi.
— Aijot nyt siis kirjoja myöskin?
— Niin aijonkin.
— Hannu, tiedätkö mikä kohtalo sinua uhkaa?

— Tiedän! Yhtä asiaa vain pelkään. Poikani…
— Kuule ystäväni! Tiedäthän salaisen veljesliiton valan auttaa
toinen toistaan; etkö luota meihin?
— Toisinaan luotan, toisinaan en. Kaikki tahtoisivat kirveen
heiluvaksi, vaan ei kukaan halua olla varressa. En minäkään halua,
mutta käyn kuitenkin varteen vastoin tahtoani. Minä tunnen kuinka
ne, jotka ovat takanani, tunkevat minua eteenpäin ja minä en
vetäydy syrjään. Sano mulle, veli, jos tapaat murhaajan, ethän tyydy
siihen, että luet hänelle asianomaiset pykälät rikoslaista. Etkö ota
murhaajaa lujille ja hirtä häntä? Tai voitko rangaista murhaa muulla
tavoin? Tapahtuuko murhaa ilman murhaajaa? En pyydä vastausta
tähän kysymykseen, sillä oikeastaan kysynkin itseltäni. Ajattelen
näin: enhän oikeastaan aijo hävittää itse laitosta, sillä kirkko on
itsessään hyödyllinen laitos, ei, miesten kimppuun minun täytyy
käydä, laitoksen kelvottomain hoitajain. Enhän tahdo sanoa, että
kaikki arkkipiispat ovat lurjuksia, sillä se ei olisi totta, vaan minä
tahdon osottaa sormellani juuri tätä erityistä arkkipiispaa,
sukurutsaajaa ja ryöväriä. En käy paavin kimppuun sen vuoksi, että
hän on kirkon päämies, sillä olkoon mielellään kirkolla, kun se on
yhteiskunta, päämies, ei, minä tarkoitan erityisesti Benedictusta,
siveetömyydessä eläjää, julmuria, jolla kumminkin on pyhyyden
arvonimi.
— Luuletko sinä, että minä tuomarina olen iloissani tuomitessani
ketään kuolemaan; enhän minä kumminkaan tuomitse, heidän
rikokset ovat tuominneet heidät ja minä julkiluen ainoastaan
tuomion. Pyöveli, joka panee kuolemantuomion toimeen, ei ole
murhaaja, mutta hän ei siltä nuku rauhallisesti.
— Niin, ei ole hauska olla pyövelinä.

— Mutta jonkun siinä toimessa täytyy olla. Koko tämä yhteiskunta,
jota kutsutaan kirkoksi, on satapäinen peto: yhdeksänkymmentä
yhdeksän sen päistä kasvaa uudelleen, yksi ainoa on kuolevainen.
Iske se poikki! Se on ylinnä!
— Minä isken. Ja nyt hyvästi. Muistele poikaani, minä muistelen
sinun neljää lastasi ja multa saavat riistää kielen suusta, ennenkuin
ketään petän.
— Mene rauhassa, veljeni. Ainoastaan me, jotka olemme valheen
nähneet, mutta uskaltamatta sitä ilmaista, ainoastaan me
tunnemme, miten me olemme taistelleet, sen on ainoastaan taivas
nähnyt, ja mitä me olemme rikkoneet salatessamme totuuden, sen
on Jumala meille anteeksi antava. Mutta pitemmälle tämä ei saa
mennä, sillä silloin meistä tulee valehtelijoita. Alttari on rakennettu,
uhri on siihen nostettu ja kansa odottaa, että taivaan tuli on
sytyttävä alttarin. Tempaa nyt käsinesi tuli alas, niin kansa on uskova
ja vapautuva. Amen Jeesuksen nimeen.
Amen, toisti Hannu ja meni.
* * * * *
Turhaan oli koko syyskauden odotettu paavin lähettilästä, mutta
nyt marraskuussa hän tuli. Hän tuli kuin mahtava maallinen ruhtinas,
seurassaan parisataa ratsumiestä ja suuri parvi munkkeja. Tukholma
oli pukeutunut juhlapukuun; ikkunoista riippui kirjavia vaatteita, liput
liehuivat talojen räystäiltä. Kirkot olivat auki, kellot soivat, urut
pauhasivat, messuja laulettiin ja poltettiin pyhää savua! Tukholman
suurkirkko oli puettu juhlapukuun; kynttilöitä paloi kaikilla alttareilla,
joka kappelissa pidettiin jumalanpalvelusta, niissä kihisi pappeja kuin
mehiläiskeoissa ja paksu savu nousi kuin pirtistä, joten pyhäinkuvat

tuskin hämärsivät savupilvien keskitse. Kynttiläjalat olivat koristetut
vaatekukkasilla ja havunoksilla ja holveista näkyi riippuvan lippuja,
joihin oli kuvattu pyhäinkuvia; alttari oli peitetty punasella vaatteella.
Kuninkaallinen vahti seisoi jo kirkon portilla; sen oli estettävä kansaa
tulvaamasta kirkkoon, ennenkuin ne, joilla oli etuoikeus, olivat
päässeet paikoilleen.
Ensiksi tulivat ammattikunnat puheenjohtajineen ja
lipunkantajineen; he kävelivät avopäin ja pitivät kynttilää kädessään;
sitten tuli kirkollisia seurueita, jotka esiintyivät suurella
mahtipontisuudella. Jokainen yhdistys asettui kappelinsa edustalle.
Sitten tulivat mustatveljekset eli dominikaanit luostaristaan kantaen
edessään mustasta puusta veistettyä ja harsolla peitettyä
ristiinnaulittua Kristusta; munkit mustanvalkoisissa pukimissaan
olivat yöperhosten näköiset, jotka lepattivat siipiään kynttiläin
ympärillä. Sitten kuninkaallinen seurue juhlapuvussa, mutta kuningas
itse oli poissa sodassa. Sen jälkeen pormestari ja raati ja lopuksi taas
sotilasvahti. Kun kirkko oli täysi, rupesi koko seurakunta veisaamaan
mahtavaa misereereä, — Herra armahda meitä!
Nyt astui arkkipiispa esiin pääalttarin eteen. Hän oli
viidenkymmenen vuotias mies, kasvot kalman kalpeat. Silmät olivat
ikäänkuin laskosten peitossa ja silmännurkista lähti leviämään
tummahtavia juovia. Suu oli kuin veitsellä viilastu ja kun ääni sen
avasi, loistivat sieltä hampaat valkosina ja terävinä kuin suden
kidasta.
Laulu vaikeni ja urut yksin soittivat pauhaavaa "jubilate"-säveltä,
johon rummut, pasuunat ja puusoittimet yhtyivät ja samassa astui
suuresta portista sisälle paavillinen lähettiläs seurueineen.
Arkkipiispa astui alas alttarilta ja kävi lähettilästä vastaan ja kun he

käytävällä tulivat vastakkain, lankesivat he polvilleen ja suutelivat
toisiaan. Arkkipiispa nousi ensiksi ja huusi: Tehkäät ovet korkeiksi ja
portit laveiksi, että kunnian kuningas mahtuu temppeliimme.
Siunattu olkoon hän, kuin tulee Herran nimeen, Hosianna!
Urut vaikenivat ja arkkipiispa piti tervehdyspuheensa Kristuksen
edustajan apostolille, joka on kuunnellut Vapahtajan kehoitusta ja
käynyt kaikkeen maailmaan julistamaan totuutta ja antamaan
syntejä anteeksi. Rooman pyhä isä on vihdoinkin kuullut pienen
pohjoisen kansan huokaukset päästä osallisuuteen Kristuksen
armosta ja lähettänyt opetuslapsensa tätä armoa anniskelemaan.
Sitten kuvasi arkkipiispa niitä luuloteltuja vaivoja ja vaaroja, joita
tämän Herran palvelijan on ollut kestettävänään tällä pitkällä ja
vaarallisella matkalla ja päätti puheensa vielä laulettavaan
Hosiannaan.
Paavin lähettiläs vastasi ja kutsui arkkipiispaa totuuden
todistajaksi, ei siksi, että tämä oli kutsunut häntä apostoliksi, vaan
siksi että hän, arkkipiispa, oli totuuden totuuden todistaja. Sitten
kertoi hän tulonsa syyt. Kerettiläiset, anttikristukset, jotka eri
nimellisinä ovat kohottaneet päätään Saksassa, ovat käyneet pyhän
isän ja pyhän kirkon kimppuun, vaikka jo viisikymmentä vuotta sitten
on poltettu eräs heidän päämiehistään, Huss niminen. Surulla on
pyhä isä nähnyt sen epäuskon, joka tällä aikakaudella puhaltaa kuin
jäinen tuuli hänen viinitarhansa ylitse, hän on itkenyt verikyyneleitä
kansojen puolesta; mutta nyt on hänet vallannut pyhä viha, hänellä
ei enää ole oikeutta tuota kärsivällisesti sietää, nyt täytyy hänen
kurittaa uppiniskaisia vitsoilla ja ruoskilla, hän on polkeva jalallaan
käärmeen päätä. Siksi levittää hän nyt kehoituksen koko
kristikuntaan, vaatii kaikkia nousemaan yhtenä miehenä ja

Welcome to our website – the perfect destination for book lovers and
knowledge seekers. We believe that every book holds a new world,
offering opportunities for learning, discovery, and personal growth.
That’s why we are dedicated to bringing you a diverse collection of
books, ranging from classic literature and specialized publications to
self-development guides and children's books.
More than just a book-buying platform, we strive to be a bridge
connecting you with timeless cultural and intellectual values. With an
elegant, user-friendly interface and a smart search system, you can
quickly find the books that best suit your interests. Additionally,
our special promotions and home delivery services help you save time
and fully enjoy the joy of reading.
Join us on a journey of knowledge exploration, passion nurturing, and
personal growth every day!
ebookbell.com