This presentation about corpus linguistics

NezrinMemmedzade1 30 views 13 slides May 11, 2024
Slide 1
Slide 1 of 13
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13

About This Presentation

corpus linguistics


Slide Content

Introduction to Corpus Linguistics By Karimli Vuqar

What is Corpus? Definition Why are they used? What are they considered to be? Method vs Theory Types of corpora? Monolingual Vs. Multilingual Parallel Vs. Translated

Corpus Linguistics LC history 1960 1st generation e.g. Brown 1975 2 nd generation e.g. Cobuild 1990 3rd generation e.g. BOE Roots CL and Linguistics Comparative linguistics Syntactics and semantics Chomskyan revolution Technology and the progress of CL Benefits of CL Problems of CL

Building the Corpora General corpora E.g. BNC, The Brown Corpus Specialized corpora How corpora is used (Written – Spoken) Materials for creating the corpora (newspapers – books – documents etc.) General (Social – science – art ..etc) Multilingual corpora – Parallel corpora Learners corpora (International Corpus of Learner English) Monitor Corpus (The Bank of English) Historical Corpus

Advantages and Disadvantages More reliable than intuition Language patterns are easily identified Deconstruct texts to discover patterns Track the development of specific features in the history of English Test hypothesis on specific language features empirically Follow language acquisition properly Draw conclusions on large amount of linguistic data Not always a complete picture Frequency rather than the possibility

CL terminology Concordance Where and in what context? Frequency Annotation Mark-up Tagging POS tagging Syntactic Treebank Semantic tagging Coding Metadata

Famous Corpora Credits: Nadja Nesselhauf

Corpora and Translation Corpus translation studies (CTS) Descriptive translation Equivalence Corpus-based translation The process Vs the product The third code Simplification Vs normalization

Methods of Research in CL Quantitative Qualitative Context Quantitative and Qualitative

Corpus Software AntConc : MICASE : Michigan Corpus of Academic Spoken English TACT:  Text Analysis Computing Tools TACTWeb : a concordance program based on TACT but for the Web SARA:  the concordance program which is specifically written for the British National Corpus

Corpus Software Continued BNCweb BNCweb is a web-based client program for searching and retrieving lexical, grammatical and textual data from the British National Corpus (BNC). It relies on the Corpus Query Processor (CQP) of the IMS Open Corpus Workbench to provide a convenient interface between the user and the rich variety of annotated text in the 100-million word BNC in its most recent incarnation, the XML-version. BNC Web Index This is the web front end to David Lee's BNC Index spreadsheet. For an introduction to BNC Index, please see  David's web site . CLAWS Part of speech tagging software for English. Clustertool Clustertool allows you to perform Hierarchical Agglomerative Cluster Analysis on your own data. CQPweb An extension of BNCweb but designed for use with any corpus. LL Calculator This calculates Log-Likelihood values from a 2x2 contingency table. LL is a more reliable alternative to the standard Pearson's chi-squared test, see Dunning (1993). LWAC LWAC is a tool for constructing corpora from web data. Sentrick Stream-oriented Java library and a set of command line tools for high quality sentence boundary detection. (Sentence segmentation / splitting / disambiguation). Currently has one model for German (trained on general text and Wikipedia lynx dumps). SigTest Flexible Significance Test System: Chi-squared test, log-likelihood test and Fisher exact test for any kind of contingency table, using R USAS Semantic tagger developed for English and extended to Finnish and Russian. VARD Variant Detector software that facilitates the pre-processing of corpora for normalisation of spelling variation (e.g. Early Modern English) Wmatrix A corpus comparison and annotation tool incorporating CLAWS and USAS in a web front end.

Additional Resources University of Lancaster Centre for Computer Corpus Research on Language (Summer School) http://ucrel.lancs.ac.uk/ McEnery , Tony, and Wilson, Andrew.  Corpus Linguistics,  2nd ed. Edinburgh University Press, 2001. ESRC Centre for Corpus Approaches to Social Science (CASS) University of Lancaster Aston, Guy and Burnard , Lou.  The BNC handbook: exploring the British National Corpus with SARA . Edinburgh University Press, 1998. McEnery , Tony, and Wilson, Andrew.  Corpus Linguistics,  2nd ed. Edinburgh University Press, 2001. Biber , Douglas, Conrad, Susan, and Reppen , Randi.  Corpus Linguistics: Investigating Language Structure and Use. CUP, 1998. 

Questions/Comments
Tags