Natural language Processing: Word Level Analysis

siddiquitanveer1 31 views 78 slides Mar 07, 2025
Slide 1
Slide 1 of 78
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78

About This Presentation

Word Level Processing


Slide Content

Natural Language
Processing
-Word Level Analysis
L2NLP (Elective)
1
TJS

Word Level Analysis
How to analyze NL Text ?
-By breaking it into constituent units:
words , phrases, sentences, paragraph
whether or not a group of words represent a
2
whether or not a group of words represent a
legitimate sentence is decided by the syntax of the
language
Before analyzing syntax, we need to understand
words, as words are the most fundamental unit
(syntactic as well as semantic) of any natural
language text

Topics to be covered
Processingcarriedoutatwordlevelincluding
-methodsforcharacterizingwordsequences
-identifyingmorphologicalvariants
3
-identifyingmorphologicalvariants
-detectingandcorrectingmisspelledwords
-identifyingcorrectpartofspeechofaword

Topics to be covered-Regular
expressions
beautiful means for describing words
we introduce regular expressions and discuss the standard
notations for describing text patterns.
their implementation using finite state
4
their implementation using finite state
automaton.
FSA’s and their variants such as finite state
transducers have found useful applications in
speech recognition and synthesis, spell
checking and information extraction.
Detecting and correcting errors is the next topic of
discussion

Topics to be covered
Finally,
Identifying the class to which a word belongs, its
basic form and its meaning
5

Regular expressions
pattern-matching standard for string parsing and
replacement.
e.g. file search patterns dir *.txt
-can be used to parse dates, urls and email addresses
6
-can be used to parse dates, urls and email addresses
-useful tools for the design of language compilers
In NLP -for tokenization, describing lexicons and
morphological analysis
Unix based editor, 'ed'
Perl was the first language that provided integrated
support for REs (uses slash around REs)

first introduced by Kleene(1956).
A regular expression is an algebraic formula
whose value is a pattern consisting of a set of
strings, called the language of the expression.
7
strings, called the language of the expression.
Thesimplestkindofregularexpressioncontains
asinglesymbol.Forexample,theexpression
/a/
Aregularexpressionmayspecifyasequenceof
charactersalso.e.g./supernova/

8

Character classes
/[abcd]/
/[0123456789]/ -any single digit
/[abcdefghijklmnopqrstuvwxyz]/-anylowercaseletter
9
/[abcdefghijklmnopqrstuvwxyz]/-anylowercaseletter
/[5-9]/ any one of the digits 5,6,7,8, or 9.
/[m-p]/

/[^x]/ -matches any single character except x
(caret at the beginning )
RE is case sensitive
10
RE is case sensitive
/sana/ will not match the string /Sana/

11

/[sS]upernova[sS]/
A ? makes the preceding character optional
/[sS]upernovas?/
12
/[sS]upernovas?/
*operatorspecifieszeroormoreoccurrencesofthe
previouscharacterorregularexpression
/b*/matcheswithb,bb,orbbbb
withaaa?
/bb*/

/[ab]*/ -zero or more as or bs
matches aa, bb, or abab ?
Kleene +
13
Kleene +
/[0-9]+/
wildcard character
dot (.) or the period -matches any single character
/.at/

14

/.....berry/
disjunction operator -pipe symbol(|).
blackberry|blackberries
15
blackberry|blackberries
blackberry|ies ??
^[A-Za-z0-9_\.-]+Matchapositivenumberof
acceptablecharactersatthestartofthestring.

Regular language
/a:b/ regular relation
16
/a:b/ regular relation
a –upper symbol
b –lower

Finite-State Automata
Afiniteautomatonhas:
1)Afinitesetofstates,oneofwhichis
designatedtheinitialstateorstartstate,and
someofwhicharedesignatedasfinalstates.
17
someofwhicharedesignatedasfinalstates.
2)Analphabet∑ofpossibleinputsymbols.
3)Afinitesetoftransitionsthattellforeach
stateandforeachsymboloftheinput
alphabet,whichstatetogotonext.

A deterministic finite-state automaton (DFA) is a 5-
tuple (Q, ,δ, i, F), where Q is a set of states, Σ is an
alphabet, i is the initial state, F Q is a set of final
states, and δ is a transition function mapping Q ×Σ to
18
states, and δ is a transition function mapping Q ×Σ to
Q. That is, for each state q and symbol a, there is at
most one state that can be reached from q by
"following" a

19
/abb|acb/.

20

Nondeterministicfinite-stateautomata
(NFA)differfromdeterministicautomata
onlyinthattheirtransitionfunctionmaps
Q×(Σ{ε})toasubsetofthepowerset
21
Q×(Σ{ε})toasubsetofthepowerset
ofQ.Thatis,foreachstate,theremightbe
morethanonetransitiononagiven
symbol,andeachonemightbeleadingto
adifferentstate.DFA'sandNFA'sare
collectivelycalledfinite-stateautomata
(FSA).

22

/(a|b)*baa$/
23

Morphological Parsing
Deals with word structure and formation of words
from smaller units
goal
-to find out morphemes using which a given word is
built.
24
built.
Morphemes
-the smallest meaning bearing units in a language
Ex:
Egg –single morpheme
Eggs –two, the morpheme egg and the morpheme -s

Morphemes: stems and affixes
Affixes are further divided in prefix, suffix,
infixes and circumflexes.
There are three main ways for word formation:
Inflection, derivation and compounding.
25
Inflection, derivation and compounding.
morphology gives an understanding of
syntactic and semantic properties of new
words

Morphological Parsing
Parsing : taking an input and producing some sort of
structures for it
Structure ?
-morphological, syntactic, semantic or pragmatic
26
Morphological
parsing
inflected
surface form
Parsed
form
Parsed form : canonical form (or lemma) of the word and a set
of tags showing its syntactical category and morphological
characteristics, e.g. possible POS and/or inflectional properties
(gender, number, person, tense, etc.)

Morphological generation is the inverse process.
Both analysis and generation rely on following sources
of information:
27
-a dictionary of the valid lemmas of the language and
-a set of inflection paradigms.

Amorphologicalparserusesfollowinginformationsources:
1.Lexicon
-listsstemsandaffixes
2.Morphotactics
Morphological Parsing
28
2.Morphotactics
-describesthewaymorphemesarearranged,ortoucheach
other.
3.Orthographicrules
-spellingrulesthatspecifythechangesthatoccurwhen
twogivenmorphemecombine.

Why MA?
29
Ref. Natural Language Processing: A Paninian Perspective by Bharti et al.

Problems with exhaustive
Lexicon
1.puts heavy demand on memory
2.fails to show the relationship among different
roots that have the same word forms
30
roots that have the same word forms
3.In Morphologically complex languages like
Turkish, the number of possible word forms
may be theoretically infinite

Stemmers: -simplest morphological systems
-collapse morphological variations of a
given word (known as word-forms) to one
31
given word (known as word-forms) to one
lemma or stem.
-do not require lexicon
-makeuseofrewriterulesoftheform:
iery(e.g.earlierearly)
ing(e.g.playingplay)

stemmers
Stemmingalgorithmsworksintwosteps:
(i)Suffixremoval-removespredefinedendingsfrom
words.
32
(ii) Recoding -adds predefined endings to the output of
the first step.
e.g. ational ate
Stemmers are not perfect.
‘organization’ ‘organ’ , universityunivers

Two-level morphological model
first proposed by Koskenniemi (1983), can
be used for highly inflected languages.
-a word is represented as a correspondence
33
between its lexical levelform and its
surface level
surface level -actual spelling of the word
lexical level -concatenation of the morphemes
constituting it

34

Books -“book + N + PL”
This model is usually implemented with a kind of
finite-state automata, called finite-state
transducers(FST).
35
transducers(FST).

FST
Afinite-statetransducer(FST)isa6-tuple(Σ
1
,
Σ
2
,Q,δ,i,F),whereQ,iandFarethesame
asforDFA,Σ
1
isinputalphabet,Σ
2
isoutput
36
asforDFA,Σ
1
isinputalphabet,Σ
2
isoutput
alphabet,andδisafunctionmappingQ×

1
U{ε})×(Σ
2
U{ε})toasubsetofthe
powersetofQ.

37
FST

38

Eachofthesestepscanbeimplementedwith
thehelpofatransducer.
Onethatmapssurfaceformtotheintermediate
39
Onethatmapssurfaceformtotheintermediate
formandanotherthatmapstheintermediate
formtothelexicalform.

40
FST for mapping English nouns

Example
41

Transducer for step 2
42

transducer mapping nouns to their stem and
morphological features
43

Spelling Error Detection and Correction
(1) substitution of a single letter
(2) omission of a single letter
(3) insertion of a single letter
44
(3) insertion of a single letter
(4) transposition of two adjacent letters.

Types of Error
Non-word errors
n-gram analysis and dictionary
lookup
45
lookup
Real word errors.

Spelling Correction
Errordetection
Errorcorrection-suggestingcorrectwords
toamisspelledone.
46
toamisspelledone.
Method
1.Isolated-errordetectionandcorrection
2.Context-dependenterrorcorrectionand
detection.

Dictionary lookup
Problems
requires existence of a lexicon
Fails with real-word error.
47
Fails with real-word error.
bigger the lexicon the more likely it is thatan error
goes undetected

Context-dependent error detection and
correction methods
Minimum edit distance
Similarity key techniques-e.g. SOUNDEX
N-gram based techniques
-require a large corpus or dictionary as training data
48
-require a large corpus or dictionary as training data
-calculate the likelihood of one character following
another
Neural Nets
Rule based techniques

Minimum Edit Distance
tuto-r
tumour
Cost=2
49
Cost=2
tut-o-r
tu-mour
Cost=3

dynamic programming
algorithm
edit distance matrix
(I,j)th entry = distance between first i
character of the source and first j character of
50
character of the source and first j character of
the target string
]arg,
cos_]1,[
[cos_]1,1[
,cos_],1[
],[ ji ett
tdeletejidist
souretsubstjidist
tinsertjidist
jidist









Minimum edit distance Algorithm
Input: Two strings, X and Y
Output: The minimum edit distance between X and Y
m length(X)
n length(Y)
for i = 0 to m do
dist[i,0] I // Column Initialization (0
th
Column)
for j = 0 to n do
51
for j = 0 to n do
dist[0,j] j // Row Initialization
for i = 1 to m do
for j = 1 to n do
dist[i,j] = min (dist[i-1,j] + insert_cost,
dist[i-1,j-1] + subst_cost(X[i]),y[j])
dist[i,j-1] + delet_cost)
end-for
End-for

Example
52

Soundex Algorithm
Class of heuristics to expand a query
into phonetic equivalents
Language specific –mainly for names
E.g., chebyshevtchebycheff
53
E.g., chebyshevtchebycheff

Soundex
Convert source string into a 4-character reduced
form
Do the same with target string
54

Soundex Algorithm
1.Keep the first letter
2.Code the rests into digits
as shown in table
3.Ignore letters with same
Letters Code
a e h i o u w y 0
b f p v 1
55
3.Ignore letters with same
Soundex digit
4.Eliminate all zeros
5.Truncate or pad with
zeros to one initial letter
and three digits
c g j k q s x z 2
d t 3
l 4
m n 5
r 6
Soundex Phonetic code

Soundex
The coding scheme is based on observations like:
-vowels are interchangeable
-consonants with similar sounds are put in
equivalence class
Example: Dickson, Dikson, Dixon
56
Example: Dickson, Dikson, Dixon
Developed by Odell and Russell in 1918 and used
in US census to match American English names
Soundex fails on two names with different initials
(e.g.: Karlson, Carlson)
Also in other cases (e.g. Rodgers, Rogers)

Words and Word Classes
Words are classified into categories called
part of speech(word classes or lexical
categories)
57

Part of Speech
NNnoun student,chair,proof,mechanism
VB verb study,increase,produce
ADJadj large,high,tall,few,
JJ adverb carefullyslowly,uniformly
58
JJ adverb carefullyslowly,uniformly
IN prepositionin,on,to,of
PRPpronoun I,me,they
DETdeterminerthe,a,an,this,those
open vs. closed word classes

Penn Treebank tagset for verb
59

60

Part of Speech tagging
Process of assigning a part of speech like noun,
verb, pronoun, preposition, adverb, adjective,
etc. to each word in a sentence
61
POS tagger
Words
+
tag set
POS
tag

Speech/NNsounds/NNSwere/VBDsampled/VBN
by/INa/DTmicrophone/NN.
Anothertaggingpossibleforthesentenceis:
62
Speech/NNsounds/VBZwere/VBDsampled/VBN
by/INa/DTmicrophone/NN.

Part of speech tagging
methods
Rule-based(linguistic)Tagger
Stochastic(Data-driven)Tagger
TBL(TransformationBasedLearning)-Hybrid
63
TBL(TransformationBasedLearning)-Hybrid

Rule-based (linguistic)
Steps:
1.Dictionarylookuppotentialtags
2.Hand-codedRules
Theshowmustgoon.
64
Theshowmustgoon.
Step1NN,VB
Step2discardincorrecttag
Rule:IFprecedingwordisdeterminerTHEN
eliminateVBtag.

Morphological information
IF word ends in –ing and preceding word is a verb
THEN label it a verb (VB).
65
Capitalization information

+
Speed
Deterministic
66
Deterministic
-
requires manual work
usable for only one language

Stochastic Tagger
Assumption:
probability of a chain of symbols can
be approximated in terms of its arts or
n-grams.
67
n-grams.
-Simplest: uni-gram (most likely tag)
-Bi-gram: uses current word and the
tag of the previous word

She had a fast {NN} (1)
Muslims fast{VB} during Ramadan. (2)
Those who injured in the accident need medical help fast. (3)
68
Uni-gram tagger –Assign VB to all occurrences of fast
(assuming fast occurs more as Verb than other POS).
Bi-gram tagger: As the tag sequence “DT NN” is more likely than the
tag sequence “DT JJ”, a bi-gram model will assign correct tag to the
word fastin the sentence (1)
It is more likely that an adverb (as compared to noun or
adjective) follows a verb. Hence, in sentence (3) the tag
assigned to fastwill be RB.

n-gram model considers the current word and
the tag of the previous n-1 words in assigning
tag to a word
69

HMM tagger
Assigning tag sequence to a sentence
Hidden Markov Models uses lexical and bi-gram
probabilities estimated over a tagged training
corpus in order to compute the most likely tag
70
corpus in order to compute the most likely tag
sequence for each sentence.

LetWbeasequenceofwords
W=w
1
,w
2
,…,w
n
thetaskistofindthetagsequence
T=t,t,…,t
71
T=t
1
,t
2
,…,t
n
whichmaximizesP(T|W),i.e.
T’=argmax
T
P(T|W)
ApplyingBayesRule,P(T/W)canbeestimatedusing
theexpression:
P(T|W)=P(W|T)*P(T)/P(W)

Astheprobabilityofthewordsequence,
P(W),remainsthesameforeachtag
sequencewecandropit.Theexpressionfor
themostlikelytagsequencebecomes:
72
themostlikelytagsequencebecomes:
T’=argmax
T
P(W|T)*P(T)
UsingMarkovAssumption:
P(T)=P(t
1)*P(t
2/t
1)*P(t
3/t
1t
2)...P(t
n/t
1t
2…t
n-1)
Fortri-grammodel:
P(T)=P(t
1)*P(t
2/t
1)*P(t
3/t
1t
2)...P(t
n/t
n-2t
n-1)

P (W/T) –Prob. of seeing a word sequence,
given a tag sequence
Apply following Assumptions:
The words are independent of each other and
the prob. Of a word depends only on its tag
P(W/T)= P(w
1/t
1)* P(w
2/t
2) P(w
3/t
3)… P(w
n/t
n)
Estimating Probabilities:
P(t
i/t
i-2t
i-1) = C(t
i-2,t
i-1,t
i)/C(t
i-2t
i-1)
P(w
i/t
i) = C(w
i,t
i)/C(t
i)
73

Hybrid Taggers-TBL
74
TBL Learner

TBL Tagging Algorithm
Input: Tagged Corpus and Lexicon (with
most frequent information)
1.Label very word with most likely tag
2.Apply every transformation and select 2.Apply every transformation and select
one which most improves tagging
3.Re-tag corpus applying the rules
Repeat step 1-3 until some stopping
criterion is met
Output Ranked sequence of transformation
Rules
75

Example templates and
Rules learned
76

Unknown words
In handling unknown words, a POS-tagger can
adopt the following strategies
assign all possible tags to the unknown word
assign the most probable tag to the unknown
word
77
same distribution as ‘Things seen once’
estimator of ‘things never seen’
use word features i.e. see how words are
spelled (prefixes, suffixes, word length,
capitalization) to guess a (set of) word
class(es). --Most powerful

Most powerful unknown word detectors
32 derivational endings ( -ion,etc.);
capitalization; hyphenation
More generally: should use morphological
78
More generally: should use morphological
analysis! (and some kind of machine learning
approach)
Tags