Natural language Processing: Word Level Analysis

Natural Language
Processing
-Word Level Analysis
L2NLP (Elective)
1
TJS

Word Level Analysis
How to analyze NL Text ?
-By breaking it into constituent units:
words , phrases, sentences, paragraph
whether or not a group of words represent a
2
whether or not a group of words represent a
legitimate sentence is decided by the syntax of the
language
Before analyzing syntax, we need to understand
words, as words are the most fundamental unit
(syntactic as well as semantic) of any natural
language text

Topics to be covered
Processingcarriedoutatwordlevelincluding
-methodsforcharacterizingwordsequences
-identifyingmorphologicalvariants
3
-identifyingmorphologicalvariants
-detectingandcorrectingmisspelledwords
-identifyingcorrectpartofspeechofaword

Topics to be covered-Regular
expressions
beautiful means for describing words
we introduce regular expressions and discuss the standard
notations for describing text patterns.
their implementation using finite state
4
their implementation using finite state
automaton.
FSA’s and their variants such as finite state
transducers have found useful applications in
speech recognition and synthesis, spell
checking and information extraction.
Detecting and correcting errors is the next topic of
discussion

Topics to be covered
Finally,
Identifying the class to which a word belongs, its
basic form and its meaning
5

Regular expressions
pattern-matching standard for string parsing and
replacement.
e.g. file search patterns dir *.txt
-can be used to parse dates, urls and email addresses
6
-can be used to parse dates, urls and email addresses
-useful tools for the design of language compilers
In NLP -for tokenization, describing lexicons and
morphological analysis
Unix based editor, 'ed'
Perl was the first language that provided integrated
support for REs (uses slash around REs)

first introduced by Kleene(1956).
A regular expression is an algebraic formula
whose value is a pattern consisting of a set of
strings, called the language of the expression.
7
strings, called the language of the expression.
Thesimplestkindofregularexpressioncontains
asinglesymbol.Forexample,theexpression
/a/
Aregularexpressionmayspecifyasequenceof
charactersalso.e.g./supernova/

8

Character classes
/[abcd]/
/[0123456789]/ -any single digit
/[abcdefghijklmnopqrstuvwxyz]/-anylowercaseletter
9
/[abcdefghijklmnopqrstuvwxyz]/-anylowercaseletter
/[5-9]/ any one of the digits 5,6,7,8, or 9.
/[m-p]/

/[^x]/ -matches any single character except x
(caret at the beginning )
RE is case sensitive
10
RE is case sensitive
/sana/ will not match the string /Sana/

11

/[sS]upernova[sS]/
A ? makes the preceding character optional
/[sS]upernovas?/
12
/[sS]upernovas?/
*operatorspecifieszeroormoreoccurrencesofthe
previouscharacterorregularexpression
/b*/matcheswithb,bb,orbbbb
withaaa?
/bb*/

/[ab]*/ -zero or more as or bs
matches aa, bb, or abab ?
Kleene +
13
Kleene +
/[0-9]+/
wildcard character
dot (.) or the period -matches any single character
/.at/

14

/.....berry/
disjunction operator -pipe symbol(|).
blackberry|blackberries
15
blackberry|blackberries
blackberry|ies ??
^[A-Za-z0-9_\.-]+Matchapositivenumberof
acceptablecharactersatthestartofthestring.

Regular language
/a:b/ regular relation
16
/a:b/ regular relation
a –upper symbol
b –lower

Finite-State Automata
Afiniteautomatonhas:
1)Afinitesetofstates,oneofwhichis
designatedtheinitialstateorstartstate,and
someofwhicharedesignatedasfinalstates.
17
someofwhicharedesignatedasfinalstates.
2)Analphabet∑ofpossibleinputsymbols.
3)Afinitesetoftransitionsthattellforeach
stateandforeachsymboloftheinput
alphabet,whichstatetogotonext.

A deterministic finite-state automaton (DFA) is a 5-
tuple (Q, ,δ, i, F), where Q is a set of states, Σ is an
alphabet, i is the initial state, F Q is a set of final
states, and δ is a transition function mapping Q ×Σ to
18
states, and δ is a transition function mapping Q ×Σ to
Q. That is, for each state q and symbol a, there is at
most one state that can be reached from q by
"following" a

19
/abb|acb/.

20

Nondeterministicfinite-stateautomata
(NFA)differfromdeterministicautomata
onlyinthattheirtransitionfunctionmaps
Q×(Σ{ε})toasubsetofthepowerset
21
Q×(Σ{ε})toasubsetofthepowerset
ofQ.Thatis,foreachstate,theremightbe
morethanonetransitiononagiven
symbol,andeachonemightbeleadingto
adifferentstate.DFA'sandNFA'sare
collectivelycalledfinite-stateautomata
(FSA).

22

/(a|b)*baa$/
23

Morphological Parsing
Deals with word structure and formation of words
from smaller units
goal
-to find out morphemes using which a given word is
built.
24
built.
Morphemes
-the smallest meaning bearing units in a language
Ex:
Egg –single morpheme
Eggs –two, the morpheme egg and the morpheme -s

Morphemes: stems and affixes
Affixes are further divided in prefix, suffix,
infixes and circumflexes.
There are three main ways for word formation:
Inflection, derivation and compounding.
25
Inflection, derivation and compounding.
morphology gives an understanding of
syntactic and semantic properties of new
words

Morphological Parsing
Parsing : taking an input and producing some sort of
structures for it
Structure ?
-morphological, syntactic, semantic or pragmatic
26
Morphological
parsing
inflected
surface form
Parsed
form
Parsed form : canonical form (or lemma) of the word and a set
of tags showing its syntactical category and morphological
characteristics, e.g. possible POS and/or inflectional properties
(gender, number, person, tense, etc.)

Morphological generation is the inverse process.
Both analysis and generation rely on following sources
of information:
27
-a dictionary of the valid lemmas of the language and
-a set of inflection paradigms.

Amorphologicalparserusesfollowinginformationsources:
1.Lexicon
-listsstemsandaffixes
2.Morphotactics
Morphological Parsing
28
2.Morphotactics
-describesthewaymorphemesarearranged,ortoucheach
other.
3.Orthographicrules
-spellingrulesthatspecifythechangesthatoccurwhen
twogivenmorphemecombine.

Why MA?
29
Ref. Natural Language Processing: A Paninian Perspective by Bharti et al.

Problems with exhaustive
Lexicon
1.puts heavy demand on memory
2.fails to show the relationship among different
roots that have the same word forms
30
roots that have the same word forms
3.In Morphologically complex languages like
Turkish, the number of possible word forms
may be theoretically infinite

Stemmers: -simplest morphological systems
-collapse morphological variations of a
given word (known as word-forms) to one
31
given word (known as word-forms) to one
lemma or stem.
-do not require lexicon
-makeuseofrewriterulesoftheform:
iery(e.g.earlierearly)
ing(e.g.playingplay)

stemmers
Stemmingalgorithmsworksintwosteps:
(i)Suffixremoval-removespredefinedendingsfrom
words.
32
(ii) Recoding -adds predefined endings to the output of
the first step.
e.g. ational ate
Stemmers are not perfect.
‘organization’ ‘organ’ , universityunivers

Two-level morphological model
first proposed by Koskenniemi (1983), can
be used for highly inflected languages.
-a word is represented as a correspondence
33
between its lexical levelform and its
surface level
surface level -actual spelling of the word
lexical level -concatenation of the morphemes
constituting it

34

Books -“book + N + PL”
This model is usually implemented with a kind of
finite-state automata, called finite-state
transducers(FST).
35
transducers(FST).

FST
Afinite-statetransducer(FST)isa6-tuple(Σ
1
,
Σ
2
,Q,δ,i,F),whereQ,iandFarethesame
asforDFA,Σ
1
isinputalphabet,Σ
2
isoutput
36
asforDFA,Σ
1
isinputalphabet,Σ
2
isoutput
alphabet,andδisafunctionmappingQ×
(Σ
1
U{ε})×(Σ
2
U{ε})toasubsetofthe
powersetofQ.

37
FST

38

Eachofthesestepscanbeimplementedwith
thehelpofatransducer.
Onethatmapssurfaceformtotheintermediate
39
Onethatmapssurfaceformtotheintermediate
formandanotherthatmapstheintermediate
formtothelexicalform.

40
FST for mapping English nouns

Example
41

Transducer for step 2
42

transducer mapping nouns to their stem and
morphological features
43

Spelling Error Detection and Correction
(1) substitution of a single letter
(2) omission of a single letter
(3) insertion of a single letter
44
(3) insertion of a single letter
(4) transposition of two adjacent letters.

Types of Error
Non-word errors
n-gram analysis and dictionary
lookup
45
lookup
Real word errors.

Spelling Correction
Errordetection
Errorcorrection-suggestingcorrectwords
toamisspelledone.
46
toamisspelledone.
Method
1.Isolated-errordetectionandcorrection
2.Context-dependenterrorcorrectionand
detection.

Dictionary lookup
Problems
requires existence of a lexicon
Fails with real-word error.
47
Fails with real-word error.
bigger the lexicon the more likely it is thatan error
goes undetected

Context-dependent error detection and
correction methods
Minimum edit distance
Similarity key techniques-e.g. SOUNDEX
N-gram based techniques
-require a large corpus or dictionary as training data
48
-require a large corpus or dictionary as training data
-calculate the likelihood of one character following
another
Neural Nets
Rule based techniques

Minimum Edit Distance
tuto-r
tumour
Cost=2
49
Cost=2
tut-o-r
tu-mour
Cost=3

dynamic programming
algorithm
edit distance matrix
(I,j)th entry = distance between first i
character of the source and first j character of
50
character of the source and first j character of
the target string
]arg,
cos_]1,[
[cos_]1,1[
,cos_],1[
],[ ji ett
tdeletejidist
souretsubstjidist
tinsertjidist
jidist










Minimum edit distance Algorithm
Input: Two strings, X and Y
Output: The minimum edit distance between X and Y
m length(X)
n length(Y)
for i = 0 to m do
dist[i,0] I // Column Initialization (0
th
Column)
for j = 0 to n do
51
for j = 0 to n do
dist[0,j] j // Row Initialization
for i = 1 to m do
for j = 1 to n do
dist[i,j] = min (dist[i-1,j] + insert_cost,
dist[i-1,j-1] + subst_cost(X[i]),y[j])
dist[i,j-1] + delet_cost)
end-for
End-for

Example
52

Soundex Algorithm
Class of heuristics to expand a query
into phonetic equivalents
Language specific –mainly for names
E.g., chebyshevtchebycheff
53
E.g., chebyshevtchebycheff

Soundex
Convert source string into a 4-character reduced
form
Do the same with target string
54

Soundex Algorithm
1.Keep the first letter
2.Code the rests into digits
as shown in table
3.Ignore letters with same
Letters Code
a e h i o u w y 0
b f p v 1
55
3.Ignore letters with same
Soundex digit
4.Eliminate all zeros
5.Truncate or pad with
zeros to one initial letter
and three digits
c g j k q s x z 2
d t 3
l 4
m n 5
r 6
Soundex Phonetic code

Soundex
The coding scheme is based on observations like:
-vowels are interchangeable
-consonants with similar sounds are put in
equivalence class
Example: Dickson, Dikson, Dixon
56
Example: Dickson, Dikson, Dixon
Developed by Odell and Russell in 1918 and used
in US census to match American English names
Soundex fails on two names with different initials
(e.g.: Karlson, Carlson)
Also in other cases (e.g. Rodgers, Rogers)

Words and Word Classes
Words are classified into categories called
part of speech(word classes or lexical
categories)
57

Part of Speech
NNnoun student,chair,proof,mechanism
VB verb study,increase,produce
ADJadj large,high,tall,few,
JJ adverb carefullyslowly,uniformly
58
JJ adverb carefullyslowly,uniformly
IN prepositionin,on,to,of
PRPpronoun I,me,they
DETdeterminerthe,a,an,this,those
open vs. closed word classes

Penn Treebank tagset for verb
59

60

Part of Speech tagging
Process of assigning a part of speech like noun,
verb, pronoun, preposition, adverb, adjective,
etc. to each word in a sentence
61
POS tagger
Words
+
tag set
POS
tag

Speech/NNsounds/NNSwere/VBDsampled/VBN
by/INa/DTmicrophone/NN.
Anothertaggingpossibleforthesentenceis:
62
Speech/NNsounds/VBZwere/VBDsampled/VBN
by/INa/DTmicrophone/NN.

Part of speech tagging
methods
Rule-based(linguistic)Tagger
Stochastic(Data-driven)Tagger
TBL(TransformationBasedLearning)-Hybrid
63
TBL(TransformationBasedLearning)-Hybrid

Rule-based (linguistic)
Steps:
1.Dictionarylookuppotentialtags
2.Hand-codedRules
Theshowmustgoon.
64
Theshowmustgoon.
Step1NN,VB
Step2discardincorrecttag
Rule:IFprecedingwordisdeterminerTHEN
eliminateVBtag.

Morphological information
IF word ends in –ing and preceding word is a verb
THEN label it a verb (VB).
65
Capitalization information

+
Speed
Deterministic
66
Deterministic
-
requires manual work
usable for only one language

Stochastic Tagger
Assumption:
probability of a chain of symbols can
be approximated in terms of its arts or
n-grams.
67
n-grams.
-Simplest: uni-gram (most likely tag)
-Bi-gram: uses current word and the
tag of the previous word

She had a fast {NN} (1)
Muslims fast{VB} during Ramadan. (2)
Those who injured in the accident need medical help fast. (3)
68
Uni-gram tagger –Assign VB to all occurrences of fast
(assuming fast occurs more as Verb than other POS).
Bi-gram tagger: As the tag sequence “DT NN” is more likely than the
tag sequence “DT JJ”, a bi-gram model will assign correct tag to the
word fastin the sentence (1)
It is more likely that an adverb (as compared to noun or
adjective) follows a verb. Hence, in sentence (3) the tag
assigned to fastwill be RB.

n-gram model considers the current word and
the tag of the previous n-1 words in assigning
tag to a word
69

HMM tagger
Assigning tag sequence to a sentence
Hidden Markov Models uses lexical and bi-gram
probabilities estimated over a tagged training
corpus in order to compute the most likely tag
70
corpus in order to compute the most likely tag
sequence for each sentence.

LetWbeasequenceofwords
W=w
1
,w
2
,…,w
n
thetaskistofindthetagsequence
T=t,t,…,t
71
T=t
1
,t
2
,…,t
n
whichmaximizesP(T|W),i.e.
T’=argmax
T
P(T|W)
ApplyingBayesRule,P(T/W)canbeestimatedusing
theexpression:
P(T|W)=P(W|T)*P(T)/P(W)

Astheprobabilityofthewordsequence,
P(W),remainsthesameforeachtag
sequencewecandropit.Theexpressionfor
themostlikelytagsequencebecomes:
72
themostlikelytagsequencebecomes:
T’=argmax
T
P(W|T)*P(T)
UsingMarkovAssumption:
P(T)=P(t
1)*P(t
2/t
1)*P(t
3/t
1t
2)...P(t
n/t
1t
2…t
n-1)
Fortri-grammodel:
P(T)=P(t
1)*P(t
2/t
1)*P(t
3/t
1t
2)...P(t
n/t
n-2t
n-1)

P (W/T) –Prob. of seeing a word sequence,
given a tag sequence
Apply following Assumptions:
The words are independent of each other and
the prob. Of a word depends only on its tag
P(W/T)= P(w
1/t
1)* P(w
2/t
2) P(w
3/t
3)… P(w
n/t
n)
Estimating Probabilities:
P(t
i/t
i-2t
i-1) = C(t
i-2,t
i-1,t
i)/C(t
i-2t
i-1)
P(w
i/t
i) = C(w
i,t
i)/C(t
i)
73

Hybrid Taggers-TBL
74
TBL Learner

TBL Tagging Algorithm
Input: Tagged Corpus and Lexicon (with
most frequent information)
1.Label very word with most likely tag
2.Apply every transformation and select 2.Apply every transformation and select
one which most improves tagging
3.Re-tag corpus applying the rules
Repeat step 1-3 until some stopping
criterion is met
Output Ranked sequence of transformation
Rules
75

Example templates and
Rules learned
76

Unknown words
In handling unknown words, a POS-tagger can
adopt the following strategies
assign all possible tags to the unknown word
assign the most probable tag to the unknown
word
77
same distribution as ‘Things seen once’
estimator of ‘things never seen’
use word features i.e. see how words are
spelled (prefixes, suffixes, word length,
capitalization) to guess a (set of) word
class(es). --Most powerful

Most powerful unknown word detectors
32 derivational endings ( -ion,etc.);
capitalization; hyphenation
More generally: should use morphological
78
More generally: should use morphological
analysis! (and some kind of machine learning
approach)

Natural language Processing: Word Level Analysis

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Natural language Processing: Word Level Analysis

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77