Information retrieval chapter 2-Text Operations.ppt

SamuelKetema1 300 views 32 slides Apr 09, 2024
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Chapter two for information retriveal course


Slide Content

Chapter 2 : Text Operations
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)

Howisthefrequencyofdifferentwordsdistributed?
Howfastdoesvocabularysizegrowwiththesizeofacorpus?
SuchfactorsaffecttheperformanceofIRsystem&canbeused
toselectsuitabletermweights&otheraspectsofthesystem.
Afewwordsareverycommon.
2mostfrequentwords(e.g.“the”,“of”)canaccountforabout
10%ofwordoccurrences.
Mostwordsareveryrare.
Halfthewordsinacorpusappearonlyonce,called“readonly
once”.
Statistical Properties of Text

Sample Word Frequency Data

Notallwordsinadocumentareequallysignificanttorepresentthe
contents/meaningsofadocument.
Somewordcarrymoremeaningthanothers.
Nounwordsarethemostrepresentativeofadocumentcontent.
Therefore,oneneedstopreprocessthetextofadocumentinacollection
tobeusedasindexterms.
Usingthesetofallwordsinacollectiontoindexdocumentscreatestoo
muchnoisefortheretrievaltask.
Reducenoisemeansreducewordswhichcanbeusedtorefertothe
document.
Preprocessingistheprocessofcontrollingthesizeofthevocabularyor
thenumberofdistinctwordsusedasindexterms.
Preprocessingwillleadtoanimprovementintheinformationretrieval
performance.
However,somesearchenginesontheWebomitpreprocessing.
Everywordinthedocumentisanindexterm.
Text Operations

Textoperationsistheprocessoftexttransformationsinto
logicalrepresentations.
Themainoperationsforselectingindexterms,i.e.tochoose
words/stems(orgroupsofwords)tobeusedasindexingterms
are:
Lexicalanalysis/Tokenizationofthetext:-digits,hyphens,
punctuationsmarks,andthecaseofletters.
Eliminationofstopwords:-filteroutwordswhicharenotuseful
intheretrievalprocess.
Stemmingwords:-removeaffixes(prefixesandsuffixes)
Constructionoftermcategorizationstructuressuchasthesaurus,
tocapturerelationshipforallowingtheexpansionoftheoriginal
querywithrelatedterms.
Text Operations …

TextProcessingSystem:
Inputtext:-fulltext,abstractortitle.
Output:-adocumentrepresentativeadequateforuseinan
automaticretrievalsystem.
Thedocumentrepresentativeconsistsofalistofclassnames,
eachnamerepresentingaclassofwordsoccurringinthetotal
inputtext.
Adocumentwillbeindexedbyanameifoneofitssignificant
wordsoccursasamemberofthatclass.
Generating Document
Representatives
Index
terms
Tokenization
stemming Thesaurus
stop words
documents

Changetextofthedocumentsintowordstobeadoptedasindex
terms.
Objective-identifywordsinthetext:
Digits,hyphens,punctuationmarks,caseofletters.
Numbersarenotgoodindexterms(like1910,1999);but510B.C.
–unique.
Hyphen–breakupthewords(e.g.state-of-the-art=stateofthe
art)-butsomewords,e.g.gilt-edged,B-49-uniquewordswhich
requirehyphens.
Punctuationmarks–removetotallyunlesssignificant,e.g.program
code:x.exeandxexe.
Caseofletters–notimportantandcanconvertalltoupperor
lower.
Lexical Analysis/Tokenization of
Text

Analyzetextintoasequenceofdiscretetokens(words).
Input:“Friends,RomansandCountrymen”
Output:Tokens(aninstanceofasequenceofcharactersthatare
groupedtogetherasausefulsemanticunitforprocessing)
Friends
Romans
and
Countrymen
Eachsuchtokenisnowacandidateforanindexentry,after
furtherprocessing.
Butwhatarevalidtokenstoomit?
Tokenization

Onewordormultiple:Howdoyoudecideitisonetokenor
twoormore?
Hewlett-PackardHewlettandPackardastwotokens?
State-of-the-art:breakuphyphenatedsequence.
SanFrancisco,LosAngeles
AddisAbaba,ArbaMinch
lowercase,lower-case,lowercase?
Database,database,data-base
Numbers:
Dates(3/12/19vs.Mar.12,2019);
Phonenumbers,
IPaddresses(100.2.86.144)
Issues in Tokenization

Howtohandlespecialcasesinvolvingapostrophes,hyphens
etc?C++,C#,URLs,emails,…
Sometimespunctuation(e-mail),numbers(1999),andcase
(Republicanvs.republican)canbeameaningfulpartofatoken.
However,frequentlytheyarenot.
Simplestapproachistoignoreallnumbersandpunctuationand
useonlycase-insensitiveunbrokenstringsofalphabetic
charactersastokens.
Generally,don’tindexnumbersastext,Butoftenveryuseful.
Willoftenindex“meta-data”,includingcreationdate,format,etc.
separately.
Issuesoftokenizationarelanguagespecific.
Requiresthelanguagetobeknown.
Issues in Tokenization

Thecatsleptpeacefullyinthelivingroom.It’saveryoldcat.
Mr.O’Neillthinksthattheboys’storiesaboutChile’scapital
aren’tamusing.
Exercise: Tokenization

Stopwordsareextremelycommonwordsacrossdocumentcollectionsthat
havenodiscriminatorypower.
Theymayoccurin80%ofthedocumentsinacollection.
Theywouldappeartobeoflittlevalueinhelpingselectdocumentsmatchingauser
needandneedstobefilteredoutfrompotentialindexterms.
Examplesofstopwordsarearticles,pronouns,prepositions,conjunctions,
etc.:
Articles(a,an,the);pronouns:(I,he,she,it,their,his)
Someprepositions(on,of,in,about,besides,against,over),
Conjunctions/connectors(and,but,for,nor,or,so,yet),
Verbs(is,are,was,were),
Adverbs(here,there,out,because,soon,after)and
Adjectives(all,any,each,every,few,many,some)canalsobetreatedasstopwords.
Stopwordsarelanguagedependent.
Elimination of STOPWORD

Intuition:
Stopwordshavelittlesemanticcontent;Itistypicaltoremove
suchhigh-frequencywords.
Stopwordstakeup50%ofthetext.Hence,documentsizereduces
by30-50%.
Smallerindicesforinformationretrieval.
Goodcompressiontechniquesforindices:The30mostcommon
wordsaccountfor30%ofthetokensinwrittentext.
Betterapproximationofimportanceforclassification,summarization,
etc.
Stop words

Onemethod:Sortterms(indecreasingorder)bycollectionfrequency
andtakethemostfrequentones.
Problem:Inacollectionaboutinsurancepractices,“insurance”
wouldbeastopword.
Anothermethod:Buildastopwordlistthatcontainsasetofarticles,
pronouns,etc.
Whydoweneedstoplists:Withastoplist,wecancompareand
excludefromindextermsentirelythecommonestwords.
Withtheremovalofstopwords,wecanmeasurebetter
approximationofimportanceforclassification,summarization,etc.
How to determine a list of Stop words?

StopwordeliminationusedtobestandardinolderIRsystems.
Butthetrendisgettingawayfromdoingthis.Mostwebsearch
enginesindexstopwords:
Goodqueryoptimizationtechniquesmeanyoupaylittleatquery
timeforincludingstopwords.
Youneedstopwordsfor:
Phrasequeries:“KingofDenmark”
Varioussongtitles,etc.:“Letitbe”,“Tobeornottobe”
“Relational”queries:“flightstoLondon”
Eliminationofstopwordsmightreducerecall(e.g.“Tobeornot
tobe”–alleliminatedexcept“be”–noorirrelevantretrieval)
Stop words

Stemmingreducestokenstotheir“root”formofwordstorecognize
morphologicalvariation.
Theprocessinvolvesremovalofaffixes(i.e.prefixesandsuffixes)
withtheaimofreducingvariantstothesamestem.
Oftenremovesinflectionalandderivationalmorphologyofaword.
Inflectionalmorphology:varytheformofwordsinordertoexpressgrammatical
features,suchassingular/pluralorpast/presenttense.E.g.Boy→boys,cut→cutting.
Derivationalmorphology:makesnewwordsfromoldones.E.g.creationisformed
fromcreate,buttheyaretwoseparatewords.Andalso,destruction→destroy.
Stemmingislanguagedependent:
Correctstemmingislanguagespecificandcanbecomplex.
Stemming/Morphological Analysis
for example compressed and
compression are both accepted.
for example compress and
compress are both accept

Stemmingistheprocessofreducinginflected(orsometimesderived)
wordstotheirwordstem.
AStem:theportionofawordwhichisleftaftertheremovalofits
affixes(i.e.,prefixesand/orsuffixes).
Example:‘connect’isthestemfor{connected,connecting
connection,connections}
Thus,[automate,automatic,automation]allreduceto
automat
Aclassnameisassignedtoadocumentifandonlyifoneofits
membersoccursasasignificantwordinthetextofthedocument.
Adocumentrepresentativethenbecomesalistofclassnames,
whichareoftenreferredasthedocumentsindexterms/keywords.
Queries:Queriesarehandledinthesameway.
Stemming

Therearebasicallytwowaystoimplementstemming.
Thefirstapproachistocreateabigdictionarythatmapswordsto
theirstems.
Theadvantageofthisapproachisthatitworksperfectly(insofarasthe
stemofawordcanbedefinedperfectly);thedisadvantagesarethespace
requiredbythedictionaryandtheinvestmentrequiredtomaintainthe
dictionaryasnewwordsappear.
Thesecondapproachistouseasetofrulesthatextractstemsfrom
words.
Theadvantagesofthisapproacharethatthecodeistypicallysmall,andit
cangracefullyhandlenewwords;thedisadvantageisthatitoccasionally
makesmistakes.
But,sincestemmingisimperfectlydefined,anyway,occasionalmistakes
aretolerable,andtherule-basedapproachistheonethatisgenerally
chosen.
Ways to implement stemming

Stemmingistheoperationofstrippingthesufficesfromaword,leaving
itsstem.
Google,forinstance,usesstemmingtosearchforwebpages
containingthewordsconnected,connecting,connectionand
connectionswhenusersaskforawebpagethatcontainstheword
connect.
In1979,MartinPorterdevelopedastemmingalgorithmthatusesaset
ofrulestoextractstemsfromwords,andthoughitmakessome
mistakes,mostcommonwordsseemtoworkoutright.
Porterdescribeshisalgorithmandprovidesareference
implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
Porter Stemmer

ItisthemostcommonalgorithmforstemmingEnglishwordsto
theircommongrammaticalroot.
ItusesasimpleprocedureforremovingknownaffixesinEnglish
withoutusingadictionary.Togetsridofpluralsthefollowing
rulesareused:
SSESSS caressescaress
IESi poniesponi
SSSS caress→caress
S(nil)catscat
EMENT(Deletefinalelementifwhatremainsislongerthan
1character)
 replacementreplac
 cementcement
Porter stemmer

Whilestep1agetsridofplurals,step1bremoves-edor-ing.
–e.g.
;; agreed -> agree
–;; disabled -> disable
;; matting -> mat
;; mating -> mate
;; meeting -> meet
;; milling -> mill
;; messing -> mess
;; meetings -> meet
;; feed -> feed
Porter stemmer

MayproduceunusualstemsthatarenotEnglishwords:
Removing‘UAL’fromFACTUALandEQUAL
Mayconflate(reducetothesametoken)wordsthatareactually
distinct.
“computer”,“computational”,“computation”allreducedtosame
token“comput”
Notrecognizeallmorphologicalderivations.
Stemming: challenges

Inagroupofthreestudytheporter’sstemmingalgorithmin
detailandimplementusingpython.
Assignment

Mostlyfull-textsearchingcannotbeaccurate,sincedifferent
authorsmayselectdifferentwordstorepresentthesame
concept.
Problem:Thesamemeaningcanbeexpressedusingdifferent
termsthataresynonyms.
Synonym:awordorphrasewhichhasthesameornearlythesame
meaningasanotherwordorphraseinthesamelanguage,
Orawordthatsoundsthesameorisspelledthesameasanother
wordbuthasadifferentmeaning,Hononyms.
Homonyms:awordthatsoundsthesameorisspelledthesameas
anotherwordbuthasadifferentmeaning,
Howcanitbeachievedsuchthatforthesamemeaningthe
identicaltermsareusedintheindexandthequery?
Thesauri

Thesaurus:Thevocabularyofacontrolledindexinglanguage,
formallyorganizedsothatapriorirelationshipsbetween
concepts(forexampleas"broader"and“related")aremade
explicit.
Athesauruscontainstermsandrelationshipsbetweenterms.
IRthesaurirelytypicallyupontheuseofsymbolssuchas:
USE/UF(UF=usedfor),BT(BroaderTerm),andRT(Related
term)todemonstrateinter-termrelationships.
e.g.,car=automobile,truck,bus,taxi,motorvehicle
-color=colour,paint
Thesauri

Thesaurustriestocontroltheuseofthevocabularyby
showingasetofrelatedwordstohandlesynonymsand
homonyms.
Theaimofthesaurusistherefore:
Toprovideastandardvocabularyforindexingandsearching.
Thesaurusrewritetoformequivalenceclasses,andweindexsuch
equivalences.
Whenthedocumentcontainsautomobile,indexitundercaras
well(usually,alsovice-versa)
Toassistuserswithlocatingtermsforproperqueryformulation:
Whenthequerycontainsautomobile,lookundercaraswellfor
expandingquery.
Toprovideclassifiedhierarchiesthatallowthebroadeningand
narrowingofthecurrentrequestaccordingtouserneeds.
Aim of Thesaurus

Example:thesaurusbuilttoassistIRforsearchingcarsand
vehicles:
Term:Motorvehicles:
UF:Automobiles(UF-UseFor)
 Cars
 Trucks
BT: Vehicles(BT–BroaderTerm)
RT: RoadEngineering(RT-RelatedTerm)
 RoadTransport
Thesaurus Construction

Example:thesaurusbuilttoassistIRinthefieldsofcomputer
science:
TERM:naturallanguages.
UFnaturallanguageprocessing(UF=usedforNLP)
BTlanguages(BT=broadertermislanguages)
NTlanguages(NT=Narrowerrelatedterm)
TTlanguages(TT=toptermislanguages)
RTartificialintelligence(RT=relatedterm/s)
computationallinguistic
formallanguages
querylanguages
speechrecognition
More Example

Manyoftheabovefeaturesembodytransformationsthatare:
Language-specificand
Often,application-specific.
Theseare“plug-in”adgendatotheindexingprocess.
Bothopensourceandcommercialplug-insareavailablefor
handlingthese.
Language-specificity

Indexlanguageisthelanguageusedtodescribedocuments
andrequests.
Elementsoftheindexlanguageareindextermswhichmaybe
derivedfromthetextofthedocumenttobedescribed,ormay
bearrivedatindependently.
Ifafulltextrepresentationofthetextisadopted,thenallwords
inthetextareusedasindexterms=fulltextindexing.
Otherwise,needtoselectthewordstobeusedasindexterms
forreducingthesizeoftheindexfilewhichisbasictodesignan
efficientsearchingIRsystem.
Index Term Selection

Question & Answer
4/9/2024 31

Thank You !!!
4/9/2024 32
Tags