Natural Language
Processing
-Word Level Analysis
L2NLP (Elective)
1
TJS
Word Level Analysis
How to analyze NL Text ?
-By breaking it into constituent units:
words , phrases, sentences, paragraph
whether or not a group of words represent a
2
whether or not a group of words represent a
legitimate sentence is decided by the syntax of the
language
Before analyzing syntax, we need to understand
words, as words are the most fundamental unit
(syntactic as well as semantic) of any natural
language text
Topics to be covered
Processingcarriedoutatwordlevelincluding
-methodsforcharacterizingwordsequences
-identifyingmorphologicalvariants
3
-identifyingmorphologicalvariants
-detectingandcorrectingmisspelledwords
-identifyingcorrectpartofspeechofaword
Topics to be covered-Regular
expressions
beautiful means for describing words
we introduce regular expressions and discuss the standard
notations for describing text patterns.
their implementation using finite state
4
their implementation using finite state
automaton.
FSA’s and their variants such as finite state
transducers have found useful applications in
speech recognition and synthesis, spell
checking and information extraction.
Detecting and correcting errors is the next topic of
discussion
Topics to be covered
Finally,
Identifying the class to which a word belongs, its
basic form and its meaning
5
Regular expressions
pattern-matching standard for string parsing and
replacement.
e.g. file search patterns dir *.txt
-can be used to parse dates, urls and email addresses
6
-can be used to parse dates, urls and email addresses
-useful tools for the design of language compilers
In NLP -for tokenization, describing lexicons and
morphological analysis
Unix based editor, 'ed'
Perl was the first language that provided integrated
support for REs (uses slash around REs)
first introduced by Kleene(1956).
A regular expression is an algebraic formula
whose value is a pattern consisting of a set of
strings, called the language of the expression.
7
strings, called the language of the expression.
Thesimplestkindofregularexpressioncontains
asinglesymbol.Forexample,theexpression
/a/
Aregularexpressionmayspecifyasequenceof
charactersalso.e.g./supernova/
8
Character classes
/[abcd]/
/[0123456789]/ -any single digit
/[abcdefghijklmnopqrstuvwxyz]/-anylowercaseletter
9
/[abcdefghijklmnopqrstuvwxyz]/-anylowercaseletter
/[5-9]/ any one of the digits 5,6,7,8, or 9.
/[m-p]/
/[^x]/ -matches any single character except x
(caret at the beginning )
RE is case sensitive
10
RE is case sensitive
/sana/ will not match the string /Sana/
11
/[sS]upernova[sS]/
A ? makes the preceding character optional
/[sS]upernovas?/
12
/[sS]upernovas?/
*operatorspecifieszeroormoreoccurrencesofthe
previouscharacterorregularexpression
/b*/matcheswithb,bb,orbbbb
withaaa?
/bb*/
/[ab]*/ -zero or more as or bs
matches aa, bb, or abab ?
Kleene +
13
Kleene +
/[0-9]+/
wildcard character
dot (.) or the period -matches any single character
/.at/
A deterministic finite-state automaton (DFA) is a 5-
tuple (Q, ,δ, i, F), where Q is a set of states, Σ is an
alphabet, i is the initial state, F Q is a set of final
states, and δ is a transition function mapping Q ×Σ to
18
states, and δ is a transition function mapping Q ×Σ to
Q. That is, for each state q and symbol a, there is at
most one state that can be reached from q by
"following" a
Morphological Parsing
Deals with word structure and formation of words
from smaller units
goal
-to find out morphemes using which a given word is
built.
24
built.
Morphemes
-the smallest meaning bearing units in a language
Ex:
Egg –single morpheme
Eggs –two, the morpheme egg and the morpheme -s
Morphemes: stems and affixes
Affixes are further divided in prefix, suffix,
infixes and circumflexes.
There are three main ways for word formation:
Inflection, derivation and compounding.
25
Inflection, derivation and compounding.
morphology gives an understanding of
syntactic and semantic properties of new
words
Morphological Parsing
Parsing : taking an input and producing some sort of
structures for it
Structure ?
-morphological, syntactic, semantic or pragmatic
26
Morphological
parsing
inflected
surface form
Parsed
form
Parsed form : canonical form (or lemma) of the word and a set
of tags showing its syntactical category and morphological
characteristics, e.g. possible POS and/or inflectional properties
(gender, number, person, tense, etc.)
Morphological generation is the inverse process.
Both analysis and generation rely on following sources
of information:
27
-a dictionary of the valid lemmas of the language and
-a set of inflection paradigms.
Why MA?
29
Ref. Natural Language Processing: A Paninian Perspective by Bharti et al.
Problems with exhaustive
Lexicon
1.puts heavy demand on memory
2.fails to show the relationship among different
roots that have the same word forms
30
roots that have the same word forms
3.In Morphologically complex languages like
Turkish, the number of possible word forms
may be theoretically infinite
Stemmers: -simplest morphological systems
-collapse morphological variations of a
given word (known as word-forms) to one
31
given word (known as word-forms) to one
lemma or stem.
-do not require lexicon
-makeuseofrewriterulesoftheform:
iery(e.g.earlierearly)
ing(e.g.playingplay)
stemmers
Stemmingalgorithmsworksintwosteps:
(i)Suffixremoval-removespredefinedendingsfrom
words.
32
(ii) Recoding -adds predefined endings to the output of
the first step.
e.g. ational ate
Stemmers are not perfect.
‘organization’ ‘organ’ , universityunivers
Two-level morphological model
first proposed by Koskenniemi (1983), can
be used for highly inflected languages.
-a word is represented as a correspondence
33
between its lexical levelform and its
surface level
surface level -actual spelling of the word
lexical level -concatenation of the morphemes
constituting it
34
Books -“book + N + PL”
This model is usually implemented with a kind of
finite-state automata, called finite-state
transducers(FST).
35
transducers(FST).
transducer mapping nouns to their stem and
morphological features
43
Spelling Error Detection and Correction
(1) substitution of a single letter
(2) omission of a single letter
(3) insertion of a single letter
44
(3) insertion of a single letter
(4) transposition of two adjacent letters.
Types of Error
Non-word errors
n-gram analysis and dictionary
lookup
45
lookup
Real word errors.
Dictionary lookup
Problems
requires existence of a lexicon
Fails with real-word error.
47
Fails with real-word error.
bigger the lexicon the more likely it is thatan error
goes undetected
Context-dependent error detection and
correction methods
Minimum edit distance
Similarity key techniques-e.g. SOUNDEX
N-gram based techniques
-require a large corpus or dictionary as training data
48
-require a large corpus or dictionary as training data
-calculate the likelihood of one character following
another
Neural Nets
Rule based techniques
dynamic programming
algorithm
edit distance matrix
(I,j)th entry = distance between first i
character of the source and first j character of
50
character of the source and first j character of
the target string
]arg,
cos_]1,[
[cos_]1,1[
,cos_],1[
],[ ji ett
tdeletejidist
souretsubstjidist
tinsertjidist
jidist
Minimum edit distance Algorithm
Input: Two strings, X and Y
Output: The minimum edit distance between X and Y
m length(X)
n length(Y)
for i = 0 to m do
dist[i,0] I // Column Initialization (0
th
Column)
for j = 0 to n do
51
for j = 0 to n do
dist[0,j] j // Row Initialization
for i = 1 to m do
for j = 1 to n do
dist[i,j] = min (dist[i-1,j] + insert_cost,
dist[i-1,j-1] + subst_cost(X[i]),y[j])
dist[i,j-1] + delet_cost)
end-for
End-for
Example
52
Soundex Algorithm
Class of heuristics to expand a query
into phonetic equivalents
Language specific –mainly for names
E.g., chebyshevtchebycheff
53
E.g., chebyshevtchebycheff
Soundex
Convert source string into a 4-character reduced
form
Do the same with target string
54
Soundex Algorithm
1.Keep the first letter
2.Code the rests into digits
as shown in table
3.Ignore letters with same
Letters Code
a e h i o u w y 0
b f p v 1
55
3.Ignore letters with same
Soundex digit
4.Eliminate all zeros
5.Truncate or pad with
zeros to one initial letter
and three digits
c g j k q s x z 2
d t 3
l 4
m n 5
r 6
Soundex Phonetic code
Soundex
The coding scheme is based on observations like:
-vowels are interchangeable
-consonants with similar sounds are put in
equivalence class
Example: Dickson, Dikson, Dixon
56
Example: Dickson, Dikson, Dixon
Developed by Odell and Russell in 1918 and used
in US census to match American English names
Soundex fails on two names with different initials
(e.g.: Karlson, Carlson)
Also in other cases (e.g. Rodgers, Rogers)
Words and Word Classes
Words are classified into categories called
part of speech(word classes or lexical
categories)
57
Part of Speech
NNnoun student,chair,proof,mechanism
VB verb study,increase,produce
ADJadj large,high,tall,few,
JJ adverb carefullyslowly,uniformly
58
JJ adverb carefullyslowly,uniformly
IN prepositionin,on,to,of
PRPpronoun I,me,they
DETdeterminerthe,a,an,this,those
open vs. closed word classes
Penn Treebank tagset for verb
59
60
Part of Speech tagging
Process of assigning a part of speech like noun,
verb, pronoun, preposition, adverb, adjective,
etc. to each word in a sentence
61
POS tagger
Words
+
tag set
POS
tag
Part of speech tagging
methods
Rule-based(linguistic)Tagger
Stochastic(Data-driven)Tagger
TBL(TransformationBasedLearning)-Hybrid
63
TBL(TransformationBasedLearning)-Hybrid
Morphological information
IF word ends in –ing and preceding word is a verb
THEN label it a verb (VB).
65
Capitalization information
+
Speed
Deterministic
66
Deterministic
-
requires manual work
usable for only one language
Stochastic Tagger
Assumption:
probability of a chain of symbols can
be approximated in terms of its arts or
n-grams.
67
n-grams.
-Simplest: uni-gram (most likely tag)
-Bi-gram: uses current word and the
tag of the previous word
She had a fast {NN} (1)
Muslims fast{VB} during Ramadan. (2)
Those who injured in the accident need medical help fast. (3)
68
Uni-gram tagger –Assign VB to all occurrences of fast
(assuming fast occurs more as Verb than other POS).
Bi-gram tagger: As the tag sequence “DT NN” is more likely than the
tag sequence “DT JJ”, a bi-gram model will assign correct tag to the
word fastin the sentence (1)
It is more likely that an adverb (as compared to noun or
adjective) follows a verb. Hence, in sentence (3) the tag
assigned to fastwill be RB.
n-gram model considers the current word and
the tag of the previous n-1 words in assigning
tag to a word
69
HMM tagger
Assigning tag sequence to a sentence
Hidden Markov Models uses lexical and bi-gram
probabilities estimated over a tagged training
corpus in order to compute the most likely tag
70
corpus in order to compute the most likely tag
sequence for each sentence.
LetWbeasequenceofwords
W=w
1
,w
2
,…,w
n
thetaskistofindthetagsequence
T=t,t,…,t
71
T=t
1
,t
2
,…,t
n
whichmaximizesP(T|W),i.e.
T’=argmax
T
P(T|W)
ApplyingBayesRule,P(T/W)canbeestimatedusing
theexpression:
P(T|W)=P(W|T)*P(T)/P(W)
P (W/T) –Prob. of seeing a word sequence,
given a tag sequence
Apply following Assumptions:
The words are independent of each other and
the prob. Of a word depends only on its tag
P(W/T)= P(w
1/t
1)* P(w
2/t
2) P(w
3/t
3)… P(w
n/t
n)
Estimating Probabilities:
P(t
i/t
i-2t
i-1) = C(t
i-2,t
i-1,t
i)/C(t
i-2t
i-1)
P(w
i/t
i) = C(w
i,t
i)/C(t
i)
73
Hybrid Taggers-TBL
74
TBL Learner
TBL Tagging Algorithm
Input: Tagged Corpus and Lexicon (with
most frequent information)
1.Label very word with most likely tag
2.Apply every transformation and select 2.Apply every transformation and select
one which most improves tagging
3.Re-tag corpus applying the rules
Repeat step 1-3 until some stopping
criterion is met
Output Ranked sequence of transformation
Rules
75
Example templates and
Rules learned
76
Unknown words
In handling unknown words, a POS-tagger can
adopt the following strategies
assign all possible tags to the unknown word
assign the most probable tag to the unknown
word
77
same distribution as ‘Things seen once’
estimator of ‘things never seen’
use word features i.e. see how words are
spelled (prefixes, suffixes, word length,
capitalization) to guess a (set of) word
class(es). --Most powerful
Most powerful unknown word detectors
32 derivational endings ( -ion,etc.);
capitalization; hyphenation
More generally: should use morphological
78
More generally: should use morphological
analysis! (and some kind of machine learning
approach)