Introduction to Corpus Linguistics By Karimli Vuqar
What is Corpus? Definition Why are they used? What are they considered to be? Method vs Theory Types of corpora? Monolingual Vs. Multilingual Parallel Vs. Translated
Corpus Linguistics LC history 1960 1st generation e.g. Brown 1975 2 nd generation e.g. Cobuild 1990 3rd generation e.g. BOE Roots CL and Linguistics Comparative linguistics Syntactics and semantics Chomskyan revolution Technology and the progress of CL Benefits of CL Problems of CL
Building the Corpora General corpora E.g. BNC, The Brown Corpus Specialized corpora How corpora is used (Written – Spoken) Materials for creating the corpora (newspapers – books – documents etc.) General (Social – science – art ..etc) Multilingual corpora – Parallel corpora Learners corpora (International Corpus of Learner English) Monitor Corpus (The Bank of English) Historical Corpus
Advantages and Disadvantages More reliable than intuition Language patterns are easily identified Deconstruct texts to discover patterns Track the development of specific features in the history of English Test hypothesis on specific language features empirically Follow language acquisition properly Draw conclusions on large amount of linguistic data Not always a complete picture Frequency rather than the possibility
CL terminology Concordance Where and in what context? Frequency Annotation Mark-up Tagging POS tagging Syntactic Treebank Semantic tagging Coding Metadata
Famous Corpora Credits: Nadja Nesselhauf
Corpora and Translation Corpus translation studies (CTS) Descriptive translation Equivalence Corpus-based translation The process Vs the product The third code Simplification Vs normalization
Methods of Research in CL Quantitative Qualitative Context Quantitative and Qualitative
Corpus Software AntConc : MICASE : Michigan Corpus of Academic Spoken English TACT: Text Analysis Computing Tools TACTWeb : a concordance program based on TACT but for the Web SARA: the concordance program which is specifically written for the British National Corpus
Corpus Software Continued BNCweb BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word BNC in its most recent incarnation, the XML-version. BNC Web Index This is the web front end to David Lee's BNC Index spreadsheet. For an introduction to BNC Index, please see David's web site . CLAWS Part of speech tagging software for English. Clustertool Clustertool allows you to perform Hierarchical Agglomerative Cluster Analysis on your own data. CQPweb An extension of BNCweb but designed for use with any corpus. LL Calculator This calculates Log-Likelihood values from a 2x2 contingency table. LL is a more reliable alternative to the standard Pearson's chi-squared test, see Dunning (1993). LWAC LWAC is a tool for constructing corpora from web data. Sentrick Stream-oriented Java library and a set of command line tools for high quality sentence boundary detection. (Sentence segmentation / splitting / disambiguation). Currently has one model for German (trained on general text and Wikipedia lynx dumps). SigTest Flexible Significance Test System: Chi-squared test, log-likelihood test and Fisher exact test for any kind of contingency table, using R USAS Semantic tagger developed for English and extended to Finnish and Russian. VARD Variant Detector software that facilitates the pre-processing of corpora for normalisation of spelling variation (e.g. Early Modern English) Wmatrix A corpus comparison and annotation tool incorporating CLAWS and USAS in a web front end.
Additional Resources University of Lancaster Centre for Computer Corpus Research on Language (Summer School) http://ucrel.lancs.ac.uk/ McEnery , Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press, 2001. ESRC Centre for Corpus Approaches to Social Science (CASS) University of Lancaster Aston, Guy and Burnard , Lou. The BNC handbook: exploring the British National Corpus with SARA . Edinburgh University Press, 1998. McEnery , Tony, and Wilson, Andrew. Corpus Linguistics, 2nd ed. Edinburgh University Press, 2001. Biber , Douglas, Conrad, Susan, and Reppen , Randi. Corpus Linguistics: Investigating Language Structure and Use. CUP, 1998.