LOB CORPORA._Important aspects a translator needs to know
MeibisN
64 views
15 slides
Sep 03, 2024
Slide 1 of 15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
About This Presentation
why lob corpora is important
The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen, t...
why lob corpora is important
The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen, to provide a British counterpart to the Brown Corpus compiled by Henry Kučera and W. Nelson Francis for American English in the 1960s.
Its composition was designed to match the original Brown corpus in terms of its size and genres as closely as possible using documents published in the UK in 1961 by British authors.[1] Both corpora consist of 500 samples each comprising about 2000 words in the following genres:
Size: 4.39 MB
Language: en
Added: Sep 03, 2024
Slides: 15 pages
Slide Content
Lancaster-Oslo-Bergen Corpus
business and legal Translation Comparable bilingual corpora : The lob corpus
WHAT IS A CORPUS? IT IS A COLLECTION OF ELECTRONICALLY STORED SEMIOTIC DATA THAT HAS BEEN DESIGNED ACCORDING TO SPECIFIC CORPUS DESIGN CRITERIA TO BE MAXIMALLY REPRESENTATIVE OF (A PARTICULAR VARIETY OF) LANGUAGE OR OTHER SEMIOTIC SYSTEMS (Butler, 2004).
From the definition… It can be processed by software (electronically stored data). Meaning making. It includes gestures as well (semiotic). The corpus is representative of a language. The researchers carefully decide what to include and exclude, and in what proportion (has been designed carefully). It represents a valid sample of a language variety or any other semiotic system (representative). Naturally occurring examples of language (spoken or written). When we find out about the corpus we can make conclusions of the language or semiotic system.
What is corpus? It is a principled and large collection (body) of authentic texts that are stored in a computer, an analyzed using software designed for corpus analysis. “Principled” data collection is not done randomly, but following a planned operation. “Authentic” means genuine communication of people (going about their normal business). (Sinclair, 1996).
Computer Readable Semiotic Data (it makes the analysis easier, faster and more accurate). Authentic Material (people have produced it in particular social occasions, or they have been considered as what has been deemed as authentic). Designed to be representative. What is a corpus?
A comparable corpus is one corpus in a set of two or more monolingual corpora, typically each in a different language, built according to the same principles. The content is therefore similar and results can be compared between the corpora even though they are not translations of each other (and therefore, there are not aligned). Comparable corpus
NORMALLY SPECIALIZED COLLECTIONS OF SIMILAR SOURCE TEXTS IN THE TWO LANGUAGES. IT CAN BE ´ MINED ´ FOR TERMINOLOGY AND OTHER EQUIVALENCES SUCH CORPORA. COMPARABLE BILINGUAL CORPUS
The LOB Corpus exists in two main versions: the original version and a POS-tagged version. In the tagged corpus each word is accompanied by a word-class tag, assigned through a combination of automatic tagging programs and manual pre- and post-editing.
Tagged versions Each word is accompanied by a word-class tag There is no syntactic bracketing. I: a horizontal format, with a running text where each word is immediately followed by its associated tag; II: a vertical format, where each word is on a separate line together with its associated tag, some 'special information' and a reference number.