Key Issues For Corpora Selection 1 ADD A FOOTER DR AFIDA MOHAMAD ALI
( Tognini-Bonelli , 2001; McEnery and Wilson 1996)
3 ADD A FOOTER ‘ it is not easy to be confident that a sample of texts can be thoroughly representative of all possible genres or even of a particular genre or subject field or topic’ Any attempt at corpus creation is therefore a compromise between the hoped for and the achievable . ( Kennedy 1998: 62). Representative of what?
4 ADD A FOOTER A linguistic corpus should provide material for research which allows for impartial description of language use – corpus-based research is not prescriptive in nature . As a corpus user, you need to know WHAT’S in the corpus before you start looking for and interpreting linguistic data. An example: emails in British National Corpus
5 ADD A FOOTER At the time of compilation, emailing as means of communication – marginal (academic and government contexts) However, the BNC contains 7 files totaling 214,018 words Issue 1 : formality of those 7 emails as against today’s informal style in emails Issue 2 : the word scum (342 instances in emails, whereas 540 instances in the whole 100 million word corpus, the 342 included in that number . How is that possible? All the emails were taken from the Leeds United mailing list and the ‘scum’ refers to Manchester United Consequence – BNC not representative for email communication
6 ADD A FOOTER Example 1 ( McEnery & Hardie ): Language of service interactions in shops in the UK in the late 1990s , the sampling frame is clear – we would only accept data into our corpus which represents service interactions in UK shops in the 1990s. However , if we only collected data gathered in coffee shops , we would not get a balanced set of data for that population. Relatively context-specific lexis, such as latte and frapuccino , would be likely to occur much more frequently than they do in service interactions in general. Phrases which are typical of other kinds of service interactions, such as Should I wrap that for you? , might not occur at all.
A corpus is representative if …the findings based on its contents can be generalized to the said language variety (Leech, 1991); …its samples include the full range of variability in a population ( Biber 1993)
Corpus Representation
Representativeness It changes over time ( Hunston 2002): if a corpus is not regularly updated, it rapidly becomes unrepresentative . e.g. Bank of English (Uni. Of Birmingham 1980s) is a monitor (dynamic) corpus that is continually expanded since it was created. ADD A FOOTER 9
Representativeness Criteria to select texts for a corpus: External criteria ( Biber’s situational perspective): defined situationally , e.g. genres, registers, text types, etc. What kind of texts? Number of texts? Internal criteria ( Biber’s linguistic perspective): defined linguistically, taking into account the distribution of linguistic features. CIRCULAR – because a corpus is typically designed to study linguistic distribution, so there is no point in analysing a corpus where distribution of linguistic features is predetermined (planned and fixed). ADD A FOOTER 10
Representativeness 2 main types (for the range of text categories represented): General corpora – a basis for an overall description of a language (variety); their r. depends on the sampling from a broad range of genres. Specialized corpora – domain- or genre specific corpora; their r. can be measured by the degree of closure or saturation (lexical richness). ADD A FOOTER 11
Balance The range of text categories included in the corpus: The acceptable balance is determined by the intended uses. A balanced corpus covers a wide range of text categories which are supposed to be representative of the language (variety) under consideration.
13 ADD A FOOTER Example 1 again: For balance, we have to characterise the range of shops whose language we wanted to sample, and collect data evenly from across that range . The shops samples are typical, that we gathered data from them in such a way as to avoid introducing skew into our dataset. Let’s say we include bookshops, so we must not just choose 1 kind of bookshop (that sells only antiquarian books). We must ensure the proportions of data in our corpus reflect , in some way, the numbers of each type of interaction of interest that actually occur . Locations of the shops should also be balanced. (cover many parts of Malaysia)
14 ADD A FOOTER The Case of Published Materials Corpus (PMC; Nelson 2000)
Balance There is no scientific measure for balance. It is more important for sample corpora (static) than for monitor corpora (dynamic)
LOB is a sample (static) corpora (1 million words) 16 Corpora which seek balance and representativeness within a given sampling frame are snapshot corpora. LOB represents a ‘snapshot’ of the standard written form of modern British English in the early 1960s . For each category, samples of data were gathered, with each sample being of roughly similar length (2,000 words ) Span of 30 years. Counterpart is Brown (American English 1961)
17 ADD A FOOTER The range of texts and linguistic distributions are interdependent: if the text range is not representative, the distribution will fail in representativeness too.
18 ADD A FOOTER DIACHRONIC STUDY FOR LOB (LEECH, 2004, Baker 2009) The development and evolution of language through history. Historical linguistics is typically a diachronic study SYNCHRONIC STUDY OF LOB AND BROWN CORPORA Investigate differences between 2 language varieties in the same period.
Sampling A corpus is a sample of a given population A sample is representative if what we find for the sample holds for the general population Samples are scaled-down versions of a larger population
Sampling Sampling unit : for written text, a s.u . could be a book, periodical or newspaper. Population : the assembly of all sampling units; it can be defined in terms of language production, reception (demographic, sex, age, etc.) or language as a product (category, genre of language data). It is the notional space within which language is sampled. Sampling frame : the list of sampling units
Sampling Sampling techniques : Simple random sampling : all sampling units within the sampling frame are numbered and the sample is chosen by use of a table or random numbers; rare features could not be accounted for. Stratified random sampling : the population is divided in relatively homogeneous groups, i.e. the strata, and then these latter are sampled at random; never less representative than the former method.
Sampling Sample size : Full texts = no balance ; peculiarity of individual texts may show through. Text chunks are sufficient (e.g. 2000 running words): frequent linguistic features are stable in their distribution and hence short text chunks are sufficient for their study ( Biber 1993). Text initial, middle and end samples must be balanced .
Sampling Proportion and number of samples: The number of samples across text categories should be proportional to their frequencies and/or weights in the target population in order for the resulting corpus to be considered as representative.
25 ADD A FOOTER Size 1960 -1980s ‘ three generations’ of Leech (1991) hundred thousand words to several hundred million (British National Corpus [BNC], Bank of English, Cambridge International Corpus [CIC], which stands at one billion words). 1990s value of smaller corpora and stressed their pedagogical purpose. ‘balanced’ and ‘representative’ picture of a specific area of the language. ‘If you are involved in language teaching rather than lexicography, single word lists from small selective corpora can be seriously useful’ ( Tribble 1997). In order to study the behaviour of words in texts, we need to have available quite a large number of occurrences ’ (Sinclair 1991 : 18).
26 ADD A FOOTER Corpus size is usually represented as: Overall number of words in the corpus (e.g. BNC is a 100 million word corpus of present day British English) Overall number of texts in the corpus (e.g. BNC features more than 4,000 texts, 90 % of the corpus and 10 % of its size goes to spoken language sample) Hoffman et al. (2008) Size
27 ADD A FOOTER I ssues of size also include the number of text types in text categories , the number of samples within each text type and the number of words per individual sample. E.g. If there are too few texts in a category, then one single text can influence the results (cf the ocurrence of SCUM in the Leeds United email list in the BNC) The number of samples from each text is also important. E.g. If you are researching the characteristics of academic research articles , then you would need samples from various parts of the article , Introduction, Methods, Results and Discussion , as they all feature different language patterns Size
28 ADD A FOOTER Oostdijk (1991) and Kennedy (1998) – ‘ A sample size of 20,000 words would yield samples that are large enough to be representative of a given variety ’. Based on heuristics (rule of thumb) ( McEnery and Hardie 2012) BNC, target sample sizes of 40,000 words have been used. Is there an ideal sample size?
29 ADD A FOOTER Size Some research show (Biber 1990) that ten texts per category (e.g. LOB Corpus) are representative enough . Still, many corpora feature more texts per category. The number of words per sample should provide a stable and reliable count of (grammatical or other) features in a text. Usually a 1,000 word sample provides a stable count of majority of usual features (Biber 1990) However, in lexicographic researches in particular, some lexemes are so rare that much larger samples are needed.
30 ADD A FOOTER Biber has pointed out, that there is considerable variation within genre, in that for some genres, 20,000 words would provide an adequate sample size. For others this would not. E.g. in creating the British English Corpus. For approximately 20,000 words, 114 faxes were collected from different sources. However , in the category of ‘business books’, 20,000 words would not cover even one book. For this reason, a larger sample size of 50,000 words was used for books, taking five 10,000 word samples from five different books. Size
31 ADD A FOOTER The BNC , with 40,000 word extracts, did not use full text (partially for copyright reasons). They used continuous text within a whole, cutting the sample at a logical point such as at the end of a chapter. This approach is well suited to study of general language. In the BEC written section, which was concerned with the specialist language of business , whole texts were used wherever possible. Use of text chunks or extracts
32 ADD A FOOTER 3 main criteria for size in written corpora (Nelson, 2010)
33 ADD A FOOTER “the overall size of a corpus can be secondary to the need for adequate sampling” (Nelson, 2010)
34 ADD A FOOTER When contacting potential sources of texts, it is essential to ensure both that the data you collect is treated according to the laws of copyright and also that you observe the privacy of the authors, if the texts come from the private domain . Draw up a contract on the usage of the data that you receive from respondents. Once all the data have been gathered, the next step is to store them and make them easily available for retrieval . (Nelson 2010) Ethics
Biber’s criteria for text sampling Channel – written, spoken, electronic Published or unpublished Institutional or non-institutional Demographics of writer/ speaker Factual or fictional Purpose – persuasive, informative, etc. Topic.
36 ADD A FOOTER The appropriate design for a corpus depend upon what part, how much and what phenomenon/a in language it is meant to represent. The representativeness of the corpus determines: The kind of research questions to be addressed Generalizabilty of the results of the research We do not know the full extent of language variability and/or variety ; therefore, no corpus can be ideally representative However, a certain degree of representativeness must be provided To sum up :
TASK FOR TODAY (GROUP DISSCUSSION) Imagine that you are supposed to compile a representative corpus of Present-day English advertisements . What texts /pieces of language would you include in your corpus? How much of each text would be ‘just right’ for its representation ? Are there any texts that you would deliberately omit from the corpus? Why? Think of the explicit criteria you would use.