DIGITALLANGUAGE RESOURCES
AND THEIR USE IN EDUCATION
Assist. Prof. Rositsa Dekova, PhD
The PaisiiHilendarskyUniversity of Plovdiv
Department of English Studies
Learning English can be fun
Educators & Computers
5
https://learningapps.org/
http://www.abcya.com/
https://www.education.com/worksheet-generator/
http://www.teachers-direct.co.uk/resources/
http://puzzlemaker.discoveryeducation.com/
http://worksheets.theteacherscorner.net/
http://tools.atozteacherstuff.com/
https://www.puzzle-maker.com/
http://www.armoredpenguin.com/
Word search example
Educators & Computers
6
Word search generated by
http://puzzlemaker.discoveryeducation.com
African Animals
Word search example
Educators & Computers
7
Word search example
Educators & Computers
8
Word search from the database of
http://www.teachers-direct.co.uk/
January
…
Monday
Word search example
Educators & Computers
9
Word search example
Educators & Computers
10
C
black
blue
brown
green
grey
orange
purple
red
white
yellow
C
ANNOTATEDCORPORA
Corpus–a large body of machine-readable
naturally occurring linguistic evidence.
Annotated Corpus –enhanced with various types
of linguistic information
Morphological
POS tagging
Semantic
tagging with words senses
Syntactic
tagging for syntactic information
…
12
Educators & Computers
TEXTSEGMENTATION
13
Educators & Computers
Electronic text is just a sequence of characters.
Before any processing is done the text has to be
segmented into linguistic units, such as words,
punctuation, numbers, alphanumericals(H2O), etc.
This process is called TOKENIZATIONand the
segmented units are called TOKENS.
The process of segmenting the text into sentences
is called SENTENCESPLITTING.
TEXTSEGMENTATION
14
Educators & Computers
Intra-sententialsegmentation:
Named Entities
Syntactic chunking (segmentation of noun
groups and verb groups)
Inter-sententialsegmentation:
Grouping of sentences and paragraphs into
discourse topics called TEXTTILES.
‘HIGH-LEVEL’ TEXTSEGMENTATION
15
Educators & Computers
Reduces inflectional forms and sometimes
derivationally related forms of a word to return
the base or dictionary form of a word, which is
known as the lemma.
For instance:
am, are, is be
car, cars, car's, cars' car
LEMMATIZATION
16
Educators & Computers
Parts of speech
The morphological and syntactic classes that the
different parts of speech can be assigned to.
POS tagging
Automatic assignment of descriptors called tags
to input tokens.
PART-OF-SPEECHTAGGING
17
Educators & Computers
THETAGSET
The tagsetincludes all the tags that will be
used in the POS tagging.
We could use a very coarsetagset:
N, V, Adj, Adv, Prep...
More commonly used set is finer-grained:
NN, NNS, NNP, NNPS, VB, VBG, VBN, VBP, VBZ…
The level of granularity used in the tagset
directly affects the search possibilities.
18
Educators & Computers
CASESOFAMBIGUITY
They love_
Vsummer_
Adjvacations.
Their love_
Nstarted in the summer_
N.
Every plant_
Nneeds water and light.
We should all plant_
Vat least one tree
in our life.
20
Educators & Computers
Examples of Taggers and Parsers
Educators & Computers
23
CLAWS WWW tagger (Free web tagging service for English)
http://ucrel.lancs.ac.uk/claws/trial.html
The Stanford Parser online
http://nlp.stanford.edu:8080/parser/
Shallow Parsing Demo
Syntactic Tree Generator URL
An app that builds syntactic trees from labelled
bracket notations.
https://demo.allennlp.org/
Educators & Computers
24
Constituency Parsing:
Breaks a text into sub-phrases
(constituents).
https://demo.allennlp.org/
Educators & Computers
25
Reading comprehension–answersquestions
about a passage of text.
Sentiment Analysis–predicts whether an input is
positive or negative.
Coreference resolution –finds all expressions that
refer to the same entity in a text.
Language Modeling–generates the most likely
next words.
…
Sentiment Analysis Examples
Educators & Computers
26
She's certainly creating a stirwith her ground-
breakingmix of rap and folk.
RoBERTaLarge: The model isvery confident
thatthe sentence has apositivesentiment.
It was a complete flopbecause I couldn’t hear her
properly.
RoBERTaLarge: The model isvery confident
thatthe sentence has anegativesentiment.
Speak Out Upper-Intermediate, p. 120, ex. 4A
Sentiment Analysis Examples
Educators & Computers
27
I just hope she doesn’t go mainstream and boring like
all the other alternative stars.
RoBERTaLarge: The model issomewhat confident
thatthe sentence has apositivesentiment.
GLoVE-LSTM: The model isvery confident
thatthe sentence has anegativesentiment.
Speak Out Upper-Intermediate, p. 120, ex. 4A
Specific Applications
Educators & Computers
28
For the visually and reading impaired students:
https://www.naturalreaders.com/online/
Online Text-to-Speech
Free Chrome extension
Dyslexia Font
http://www.robobraille.org/robobraille-projects
Convert a file into an alternative, accessible format
Teaching Guides
BRITISHNATIONALCORPUS(BNC)
A 100 million word collection of samples of
written and spoken language from a wide
range of sources, designed to represent a
wide cross-section of current British English,
both spoken and written.
Available online at:
https://www.english-corpora.org/bnc/
29
Educators & Computers
BRITISHNATIONALCORPUS(BNC)
The written partof the BNC (90%) includes
extracts from regional and national newspapers,
specialist periodicals and journals for all ages and interests,
academic books and popular fiction, published and unpublished
letters and memoranda,
school and university essays, etc.
The spoken part(10%) consists of
orthographic transcriptions of unscripted informal conversations
(recorded by volunteers selected from different age, region and social
classes in a demographically balanced way)
spoken language collected in different contexts, ranging from
formal business or government meetings to radio shows and
phone-ins
30
Educators & Computers
THE CORPUS OF CONTEMPORARY
AMERICAN ENGLISH (COCA)
The Corpus of Contemporary American
English (COCA) is the largest freely-available
corpus of English, and the only large and
balanced corpus of American English.
The corpus was created by Mark Davies of
Brigham Young University.
Available online at:
https://www.english-corpora.org/coca/
31
Educators & Computers
COCA
Contains one billion words of text in eight
genres: spoken, fiction, popular magazines,
newspapers, and academic texts.
Updated regularly –25+ millionwords
included each year from 1990-2019.
Suitable for looking at current, ongoing
changes in the language.
32
Educators & Computers
THECOCA SEARCHENGINE
Searches for exact words or phrases, wildcards,
lemmas, part of speech, or any combinations of
these.
Searches for surrounding words (collocates)
within a ten-word window.
Limit searches by frequency and compare the
frequency of words, phrases, and grammatical
constructions:
by genre or even between sub-genres (or domains)
over time
33
Educators & Computers
Results for collocates of black
34
Educators & Computers
SEMANTICALLY-BASEDQUERIESOFTHECORPUS
Contrast and compare the collocates of two related
words (little/small, democrats/republicans,
men/women).
Determine the difference in meaning or use between
these words.
Find the frequency and distribution of synonyms for
nearly 60,000 words
Compare the frequency in different genres.
Create your own lists of semantically-related words, and
then use them directly as part of the query
35
Educators & Computers
SPECIFICUSESOFCOCA
To look at recent changes in English:
morphology(new suffixes –friendlyand –gate)
syntax(including prescriptive rules, quotativelike, so
notADJ, the getpassive, resultatives, and verb
complementation)
semantics(such as changes in meaning with web,
green, or gay)
lexis–including word and phrase frequency by year,
to produce lists of all words that have had large shifts
in frequency all words that have had large shifts in
frequency between specific historical periods.
36
Educators & Computers
PARALLELCORPORA
Alignment (of bitexts)
Differences in grammatical structure
with the sun not shining -нямаше слънце
Differences in lexical structure
the thermometer walksinch by inch up to the top of the
glass, и термометърът пълзисантиметър по
сантиметърдо върха на скалатa
No lexicalization
It wasn't the butler coming back. Не беше икономът.
It’s this way Положениетое такова
37
Educators & Computers
INTELLIGENTSEARCHESINBG-EN PARALLELCORPUS
The Bulgarian National Corpus search engine is available at:
http://search.dcl.bas.bg/
The syntax allows search by (combinations of) word forms,
grammatical tags, semantic relations.
Thanks to the alignment, the corresponding sentences in
parallel documents are also accessible.
The hits are paginated and the matches are highlighted.
The user is able to view the detailed information for a given
sentence in the hit set -the sentence metadata, its context,
and correspondence(s) in the other languages.
38
Educators & Computers
SEARCHASSISTANT
39
Educators & Computers
Lexical Semantic Networks
Electronic language resources which define notions
through their relations with other notions.
LSN are knowledge representation schemes involving nodes
and links (arcs or arrows) between nodes.
The nodes represent objects or concepts.
The links represent relationsbetween nodes.
The links are directed and labeled.
40
A classical taxonomy tree
41
adult = grown-up human
man = male adult
woman = female adult
child = young human
boy = male child
girl = female child
human
adult
[+adult]
man
[+male]
woman
[-male]
child
[-adult]
boy
[+male]
girl
[-male]
WORDNET-http://wordnet.princeton.edu/
A large lexical semantic database of English
Originally developed at Princeton University (Miller, 1990)
EuroWordNet-http://www.illc.uva.nl/EuroWordNet/
BalkaNet-http://www.dblab.upatras.gr/balkanet/index.htm
Each wordnetrepresents a unique language-internal
system of lexicalizations
In addition, the wordnetsare linked to an Inter-Lingual-
Index, based on the Princeton wordnet
44
Educators & Computers
WORDNETSTRUCTURE
Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each
expressing a distinct concept.
Each synsetis linked to other synsetsby means of a
small number of “conceptual relations.”
WordNetreally consists of four sub-nets, one each
for nouns, verbs, adjectives and adverbs, with few
cross-POS pointers.
45
Educators & Computers
WORDNETSTRUCTURE
http://wordnet.princeton.edu/man/wngloss.7WN.html
Each synonym set -SYNSET -encodes the relation of
equivalence between a number of lexical items –
LITERALS where each lexeme:
has unique meaning (specified by the value of SENSE)
pertains to one and the same part of speech
(specified as the value of POS)
represents one and the same lexical meaning
(specified as the value of DEF -definition)
46
Educators & Computers
47
Educators & Computers
An example: learn (Wordnet)
48
Educators & Computers
BulNethttp://dcl.bas.bg/bulnet/
A lexical semantic network of Bulgarian
comprises around 49,189 synonym sets
distributed into nine parts of speech
open-class words: nouns, verbs, adjectives and
adverbs
closed-class words: pronouns, prepositions,
conjunctions, particles and interjections
49
Educators & Computers
STRUCTURE
Each synsetis linked to its counterpart in PWN3.0 by
means of a unique identification number –ID.
The common synsetsin the Balkan languages are
marked as common concepts subsets –BCS.
In the monlingualdatabase a synsetshould be linked to
at least one other synsetthrough an intralingual
relation.
Non-obligatory information may also be encoded such
as examples of usage, stylistic, morphological or
syntactic properties.
50
Educators & Computers
RELATIONSINBULNET
Synonymous sets are linked through various relations:
SEMANTIC
Synonymy, antonymy, hypernymy, hyponymy, meronymy,
holonymy, entailment, inclusion, causation, etc.
MORPHOSEMANTIC
BEINSTATE
MORPHOLOGICAL
DERIVED
PARTICLE
EXTRALINGUISTIC
51
Educators & Computers
SEMANTICRELATIONS
SYNONYMY–a semantic relation of equivalence
between literals belonging to the same POS;
The synonyms form the synonym set also called
SYNSET.
For example:
The lexical units
{auto:1, car:2, automobile:2, machine:3, motorcar:1}
form a synsetas they refer to the same concept.
52
Educators & Computers
SEMANTICRELATIONS
ANTONYMY–a semantic relation of opposition,
established between two members belonging to
one and the same POS.
Examples:
man –woman
Hyponyms of two antonyms (nouns) should also be
antonymous pair by pair:
man –woman
actor –actress
53
Educators & Computers
SEMANTICRELATIONS
HYPERNYMYand HYPONYMY–semantic relations
between synsets, which corresponds to the notion
of class-inclusion: if W1 is a kind of W2, then W2 is
hypernym of W1 and W1 is hyponym of W2.
Example:
rose < plant < living organism
Multi-parent relations:
actress < actor
actress < female.
54
Educators & Computers
SEMANTICRELATIONS
MERONYMYand HOLONYMY
Semantic relations linking synsets
denoting wholes with those denoting
their parts:if W1 has a W2, and W2 is
part, portion, member of W1, then W1
is a meronym of W2 and W2 is a
holonymof W1.
55
Educators & Computers
APPLICATIONS
options for synonym selection
queries for semantic relations of a word in the
language's lexical system
antonymy, holonymy, etc.
explanatory definition queries
translation equivalents for a lexical item
59
Educators & Computers
THERELATIONSINBULNET
The large number of relations encoded in Bulnet
effectively illustrates the semantic and
derivational richness of Bulgarian
This offers diverse opportunities for numerous
applications of the multilingual database.
60
Educators & Computers
Educators & Computers
61
Educators & Computers
62
Educators & Computers
63
FRAMENET(Fillmore and Baker 2001, 2010)
A lexical database of English that is both human-
and machine-readable.
Based on annotated examples of how words are
used in actual texts.
Tries to capture human insight into how a word
can be used and converts it into semantic
knowledge that is machine-readable.
Available online at:
http://www.icsi.berkeley.edu/~framenet
64
Educators & Computers
FRAMESEMANTICS(Fillmore, 1976, 1985)
A semantic frame is a structure used to define the
semantic meaning of a word.
Cutting
Frame elements are the separate elements which
make up a frame.
An Agentcuts an Iteminto Piecesusing an Instrument.
Lexical units are the words that evokea particular
frame.
carve.v, chop.v, cube.v, cut.v, dice.v, fillet.v, mince.v,
pare.v, slice.v
65
Educators & Computers
Uses of Digital Language Resources
Educators & Computers
71
EDUCATION
Intelligent searches for particular language
phenomena, i.e. search by (combinations of) word
forms, grammatical tags, semantic relations;
Collocations;
Word and phrase frequencies;
Recent changes in the language;
Translation equivalents;
Semantic structure of the words and their use;
etc.
FOR YOUR ATTENTION!
THANK YOU
Educators & Computers
Let’s playnow!
73
Educators & Computers
Educators & Computers
References
Davies, Mark. 2010. The Corpus of Contemporary American
English as the first reliable monitor corpus of English Lit
Linguist Computing (2010) 25 (4): 447-464 first published
online October 27, 2010 .
The British National Corpus, version 3 (BNC XML Edition).
2007. Distributed by Oxford University Computing Services on
behalf of the BNC Consortium. URL:
http://www.natcorp.ox.ac.uk/
Reference Guide for the British National Corpus (XML Edition)
edited by Lou Burnard, February 2007. URL:
http://www.natcorp.ox.ac.uk/XMLedition/URG/
76
Educators & Computers
References
Miller, George A. (1995). WordNet: A Lexical Database for English.
Communications of the ACM Vol. 38, No. 11: 39-41.
Fellbaum, Christiane(1998, ed.) WordNet: An Electronic Lexical
Database. Cambridge, MA: MIT Press.
Koeva, S., T. Tinchevand S. Mihov. Bulgarian Wordnet-structure and
validation. In Romanian Journal of Information Science and
Technology, Vol. 7, No. 1-2, 61-78, 2004. ISSN 1453-8245 pdffile
Koeva, S. Derivational and morphosemanticrelations in Bulgarian
Wordnet. In Intelligent Information Systems, XVI, Warsaw, Academic
Publishing House, 2008, 359-389. ISBN 978-93-60434-44-4 pdffile
77
Educators & Computers
References
Ruppenhofer, J. et al. 2010. FrameNetII: Extended Theory and
Practice. https://framenet2.icsi.berkeley.edu/docs/r1.5/book.pdf
Fillmore, Charles. Introduction to FrameNet.
https://framenet.icsi.berkeley.edu/fndrupal/sites/default/file
s/FNintroCJF.ppt
Fillmore, Charles J. 1985. Frames and the Semantics of
Understanding. Quaderni di Semantica6(2): pp. 222-53.
78
Educators & Computers