chapter 1-Overview of Information Retrieval.ppt

347 views 44 slides Apr 09, 2024
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

Information retrival slide note


Slide Content

Chapter 1 : Overview of
Information Retrieval
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of CSE
Dr. Mesfin Abebe Haile (2024)

Information Retrieval
Informationretrieval(IR)istheprocessoffindingrelevant
documentsthatsatisfiesinformationneedofusersfromlarge
collectionsofunstructuredtext.
GeneralGoalofInformationRetrieval:
Tohelpuserstofindusefulinformationbasedontheir
informationneeds(withaminimumeffort)despite:
IncreasingcomplexityofInformation,
Changingneedsofuser,
Provideimmediaterandomaccesstothedocumentcollection.
4/9/2024 2

Document Corpus
Largecollectionsofdocumentsfromvarioussources:news
articles,researchpapers,books,digitallibraries,Webpages,
etc.
SampleStatisticsofTextCollections:
Dialog:(http://www.dialog.com/)
Claimstohavemorethan20terabytesofdatain>600
Databases,>1Billionuniquerecords.
LEXIS/NEXIS:(http://www.lexisnexis.com/)
Claims7terabytes,1.7billiondocuments,1.5million
subscribers,11,400databases;>200,000searchesperday;9
mainframes,300Unixservers,200NTservers.
4/9/2024 3

Document Corpus
Largecollectionsofdocumentsfromvarioussources:news
articles,researchpapers,books,digitallibraries,Webpages,
etc.
TREC(TextREtrievalConference)collections:
Itisanannualinformationretrievalconference&competition.
Totalofabout10GBsofdatasetoftextforIRevaluation.
WebSearchEngines:
Googleclaimtoindexover3billionpages.
4/9/2024 4

Information Retrieval Systems ?
Document(Webpage)retrievalin
responsetoaquery.
Quiteeffective(atsomethings)
Commerciallysuccessful(someofthem)
Butwhatgoesonbehindthescenes?
Howdotheywork?
WhathappensbeyondtheWeb?
Websearchsystems:
Lycos,Excite,Yahoo,Google,Live,
NorthernLight,Teoma,HotBot,Baidu,

4/9/2024 5

Web Search Engines
Therearemorethan2,000generalwebsearchengines.
ThebigfourareGoogle,Yahoo!,LiveSearch,Ask.
Scientificresearch&selectedjournalssearchengine:Scirus,
About.
Metasearchengine:Search.com,Searchhippo,Searchthe.net,
Windseek,Web-search,Webcrawler,Mamma,Ixquick,
AllPlus,Fazzle,Jux2
Multimediasearchengine:Blinkx
Visualsearchengine:Ujiko,WebBrain,RedZee,Kartoo,
Mooter
Audio/soundsearchengine:Feedster,Findsounds
videosearchengine:YouTube,Trooker
–Medicalsearchengine:SearchMedica,Healia,
Omnimedicalsearch,
4/9/2024 6

Web Search Engines
Therearemorethan2,000generalwebsearchengines.
ThebigfourareGoogle,Yahoo!,LiveSearch,Ask.
Index/Directory:Sunsteam,Supercrawler,Thunderstone,
Thenet1,Webworldindex,Smartlinks,Whatusee,Re-quest,
DMOZ,Searchtheweb
Others:Lycos,Excite,Altavista,AOLSearch,Intute,Accoona,
Jayde,Hotbot,InfoMine,Slider,Selectsurf,Questfinder,Kazazz,
Answers,Factbites,Alltheweb
TherearealsoVirtualLibraries:Pinakes,WWW Virtual
Library,Digital-librarian,LibrariansInternetIndex.
4/9/2024 7

Structure of an IR System
AnInformationRetrievalSystemservesasabridgebetween
theworldofauthorsandtheworldofreaders/users.
Writerspresentasetofideasinadocumentusingasetof
concepts.
ThenUsersseektheIRsystemforrelevantdocumentsthat
satisfytheirinformationneed.
WhatisintheBlackBox?
Theblackboxistheprocessingpartoftheinformationretrieval
system.
Black box
User Documents
4/9/2024 8

Information Retrieval vs. Data
Retrieval
Exampleofdataretrievalsystemisarelationaldatabase.
Data Retrieval Info Retrieval
Data
organization
Structured (Clear Semantics: Name,
age…)
Unstructured (No fields
(other than text)
Query
Language
Artificial (defined, SQL) Free text (“natural
language”), Boolean
Query
specification
Complete Incomplete
Items wantedExact Matching Partial & Best
matching, Relevant
Accuracy 100 % (results are always “correct”)< 50 %
Error responseSensitive Insensitive
4/9/2024 9

Typical IR Task
Given:
Acorpusofdocumentcollections(text,image,video,audio)
publishedbyvariousauthors.
Auserinformationneedintheformofaquery.
AnIRsystemsearchesfor:
Arankedsetofdocumentsthatarerelevanttosatisfy
informationneedofauser.
4/9/2024 10

Typical IR System Architecture
IR
System
Query
String
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
4/9/2024 11

Web Search System
Document
corpus
Query
String
IR
System
Ranked
Documents
1. Page1
2. Page2
3. Page3
.
.
Web Spider
4/9/2024 12

Overview of the Retrieval Process
4/9/2024 13

Issues that arise in IR
Textrepresentation:
Whatmakesa“good”representation?Theuseoffree-textorcontent-bearing
index-terms?
Howisarepresentationgeneratedfromtext?
Whatareretrievableobjectsandhowaretheyorganized?
Informationneedsrepresentation:
Whatisanappropriatequerylanguage?
Howcaninteractivequeryformulationandrefinementbesupported?
Comparingrepresentations:
Whatisa“good”modelofretrieval?
Howisuncertaintyrepresented?
Evaluatingeffectivenessofretrieval:
Whataregoodmetrics?
Whatconstitutesagoodexperimentaltestbed?
4/9/2024 14

User
Interface
Text Operations
Formulate
Query
Indexing
Searching
Ranking
Index
file
Text
Query
User
need
User
feedback
Rankeddocs
Retrieveddocs
Logical viewLogical view
Inverted file Text
Database
Detail View of the Retrieval Process
4/9/2024 15

Focus in IR System Design
Inimprovingperformanceeffectivenessofthesystem.
Effectivenessofthesystemisevaluatedintermsofprecision,
recall,…
Stemming,stopwords,weightingschemes,matching
algorithms.
Inimprovingperformanceefficiency.Theconcernhereis:
Storagespaceusage,accesstime,…
Compression,data/filestructures,space–timetradeoffs
4/9/2024 16

Subsystems of IR system
ThetwosubsystemsofanIRsystem:
Indexing:
Isanofflineprocessoforganizingdocumentsusingkeywords
extractedfromthecollection.
Indexingisusedtospeedupaccesstodesiredinformationfrom
documentcollectionasperusersquery.
Searching:
Isanonlineprocessthatscansdocumentcorpustofindrelevant
documentsthatmatchesusersquery.
4/9/2024 17

Statistical Properties of Text
Howisthefrequencyofdifferentwordsdistributed?
Afewwordsareverycommon.
2mostfrequentwords(e.g.“the”,“of”)canaccountfor
about10%ofwordoccurrences.
Mostwordsareveryrare.
Halfthewordsinacorpusappearonlyonce,called“read
onlyonce”.
Howfastdoesvocabularysizegrowwiththesizeofacorpus?
SuchfactorsaffecttheperformanceofIRsystem&canbeused
toselectsuitabletermweights&otheraspectsofthesystem.
4/9/2024 18

Text Operations
Notallwordsinadocumentareequallysignificanttorepresent
thecontents/meaningsofadocument.
Somewordcarrymoremeaningthanothers.
Nounwordsarethemostrepresentativeofadocumentcontent.
Therefore,needtopreprocessthetextofadocumentina
collectiontobeusedasindexterms.
Textoperationsistheprocessoftexttransformationsinto
logicalrepresentations.
It(textoperation)generatedasetofindexterms.
4/9/2024 19

Text Operations
Mainoperationsforselectingindexterms:
Tokenization:identifyasetofwordsusedtodescribethecontent
oftextdocument.
Stopwordsremoval:filteroutfrequentlyappearingwords.
Stemmingwords:removeprefixes,infixes&suffixes.
Designtermcategorizationstructures(likethesaurus),which
capturesrelationshipforallowingtheexpansionoftheoriginal
querywithrelatedterms.
4/9/2024 20

Indexing Subsystem
Documents
Tokenize
Stoplist
Stemming & Normalize
Termweighting
Index
text
non-stop listtokens
tokens
stemmedterms
Weighted terms
Assign document identifier
documents
document IDs
4/9/2024 21

Example: Indexing
countrymen
Tokenizer
Token
stream.
Friends Romans
Stemmer and
Normalizer
Modified
tokens.
friendroman countryman
IndexerIndex File
(Inverted file).
Documents to
be indexed.
Friends, Romans, countrymen.
friend
roman
countryman
2 4
2
13 16
1
4/9/2024 22

Index File
Anindexfileconsistsofrecords,calledindexentries.
Indexfilesaremuchsmallerthantheoriginalfile.
For1GBofTRECtextcollectionthevocabularyhasasizeof
only5MB(Ref:Baeza-YatesandRibeiro-Neto,2005)
ThissizemaybefurtherreducedbyLinguisticpre-processing
(likestemming&othernormalizationmethods).
Theusualunitfortextindexingisaword.
Indexterms-areusedtolookuprecordsinafile.
Indexfileusuallyhasindextermsinasortedorder.
Thesortorderofthetermsintheindexfileprovidesanorderon
aphysicalfile.
4/9/2024 23

Building Index file
Anindexfileofadocumentisafileconsistingofalistofindex
termsandalinktooneormoredocumentsthathastheindexterm.
AgoodindexfilemapseachkeywordK
itoasetofdocumentsD
i
thatcontainthekeyword.
Anindexfileislistofsearchtermsthatareorganizedfor
associativelook-up,i.e.,toansweruser’squery:
Inwhichdocumentsdoesaspecifiedsearchtermappear?
Wherewithineachdocumentdoeseachtermappear?
Fororganizingindexfileforacollectionofdocuments,thereare
variousoptionsavailable:
Decidewhatdatastructureand/orfilestructuretouse.Isit
sequentialfile,invertedfile,suffixarray,signaturefile,etc.?
4/9/2024 24

Searching Subsystem
stemmedterms
Index file
query
Parsequery
Stemming & Normalize
Stop list
non-stop listtokens
querytokens
Similarity
Measure
Ranking
Index terms
Ranked
document set
Relevant
document set
Termweighting
Query
terms
4/9/2024 25

IR Models -Basic Concepts
OnecentralproblemregardingIRsystemsistheissueof
predictingwhichdocumentsarerelevantandwhicharenot.
Suchadecisionisusuallydependentonarankingalgorithm
whichattemptstoestablishasimpleorderingofthedocuments
retrieved.
Documentsappearningatthetopofthisorderingareconsidered
tobemorelikelytoberelevant.
ThusrankingalgorithmsareatthecoreofIRsystems.
TheIRmodelsdeterminethepredictionsofwhatisrelevantand
whatisnot,basedonthenotionofrelevanceimplementedbythe
system.
4/9/2024 26

IR Models -Basic Concepts
Afterpreprocessing,Ndistincttermsremainwhichare
UniquetermsthatformtheVOCABULARY.
Letk
ibeanindextermi&d
jbeadocumentj.
Eachterm,i,inadocumentorqueryj,isgivenareal-valued
weight,w
ij.
w
ijisaweightassociatedwith(k
i,d
j).Ifw
ij=0,itindicates
thattermdoesnotbelongtodocumentd
j.
Theweightw
ijquantifiestheimportanceoftheindextermfor
describingthedocumentcontents.
vec(d
j)=(w
1j,w
2j,…,w
tj)isaweightedvectorassociatedwith
thedocumentd
j.
4/9/2024 27

Mapping Documents & Queries
Representbothdocuments&queriesasN-dimensionalvectorsin
aterm-documentmatrix,whichshowsoccurrenceoftermsinthe
documentcollection/query.
E.g.
Anentryinthematrixcorrespondstothe“weight”ofatermin
thedocument;zeromeansthetermdoesn’texistinthedocument.),...,,();,...,,(
,,2,1,,2,1 kNkkkjNjjj
tttqtttd 

T
1T
2…. T
N
D
1w
11w
12… w
1N
D
2w
21w
22… w
2N
:: : :
:: : :
D
Mw
M1w
M2… w
MN
Documentcollectionismappedto
term-by-documentmatrix.
Viewasvectorinmultidimensional
space.
Nearbyvectorsarerelated.
4/9/2024 28

IR Models: Matching function
IRmodelsmeasurethesimilaritybetweendocumentsand
queries.
Matchingfunctionisamechanismusedtomatchquerywithaset
ofdocuments.
Forexample,thevectorspacemodelconsidersdocumentsand
queriesasvectorsinterm-spaceandmeasuresthesimilarityofthe
documenttothequery.
Techniquesformatchingincludedot-product,cosinesimilarity,
dynamicprogramming…
4/9/2024 29

IRModels
Anumberofmajormodelshavebeendevelopedtoretrieve
information:
TheBooleanmodel,
Thevectorspacemodel,
Theprobabilisticmodel,and
Othermodels.
Booleanmodel:isoftenreferredtoasthe"exactmatch"model;
Othersarethe"bestmatch"models.
4/9/2024 30

d
1
d
2
d
3
d
4 d
5
d
6
d
7
d
8
k
1
k
2
k
3
GeneratetherelevantdocumentsretrievedbytheBooleanmodel
forthequery:
q=k
1(k
2k
3)
The Boolean Model: Example
4/9/2024 31

IR System Evaluation?
ItprovidestheabilitytomeasurethedifferencebetweenIR
systems.
Howwelldooursearchengineswork?
IssystemAbetterthanB?
Underwhatconditions?
Evaluationdriveswhattoresearch:
Identifytechniquesthatworkanddonotwork,
Therearemanyretrievalmodels/algorithms/systems.
Whichoneisthebest?
Whatisthebestmethodfor:
Similaritymeasures(dot-product,cosine,…)
Indextermselection(stop-wordremoval,stemming…)
Termweighting(TF,TF-IDF,…)4/9/2024 32

Types of Evaluation Strategies
System-centeredstudies:
Givendocuments,queries,andrelevancejudgments.
Tryseveralvariationsofthesystem.
Measurewhichsystemreturnsthe“best”hitlist.
User-centeredstudies:
Givenseveralusers,andatleasttworetrievalsystems.
Haveeachusertrythesametaskonbothsystems.
Measurewhichsystemsatisfythe“best”forusers
informationneed.
4/9/2024 33

Evaluation Criteria
WhataresomemainmeasuresforevaluatinganIRsystem’s
performance?
Measureeffectivenessofthesystem:
Howisasystemcapableofretrievingrelevantdocumentsfrom
thecollection?
Isasystembetterthananotherone?
Usersatisfaction:How“good”arethedocumentsthatare
returnedasaresponsetouserquery?
“Relevance”ofresultstomeetinformationneedofusers.
4/9/2024 34

Retrieval scenario
= Relevant
document
A.
B.
C.
D.
E.
F.
Thescenariowhere13resultsretrievedbydifferentsearchengine
foragivenquery.Whichsearchengineyouprefer?Why?
4/9/2024 35
= Irrelevant
document

Measuring Retrieval Effectiveness
Metricsoftenusedtoevaluateeffectivenessofthesystem.
Recall:
Ispercentageofrelevantdocumentsretrievedfromthedatabase
inresponsetousersquery.(A/A+C)
Precision:
Ispercentageofretrieveddocumentsthatarerelevanttothe
query.(A/A+B)
RelevantIrrelevant
Retrieved
Notretrieved
AB
CD
4/9/2024 36

Query Language
Howusersquery?
ThebasicIRapproachisKeyword-basedsearch.
Queriesarecombinationsofwords.
Thedocumentcollectionissearchedfordocumentsthat
containthesewords.
Wordqueriesareintuitive,easytoexpressandprovidefast
ranking.
Therearedifferentquerylanguage:
Singlequery,
Multiplequery,
Booleanquery,....etc
4/9/2024 37

Problems with Keywords
MaynotretrieverelevantdocumentsthatincludeSynonymous
terms(wordswithsimilarmeaning).
“restaurant”vs.“café”
“Ethiopia”vs.“Abyssinia”
“Car”vs.“automobile”
“Buy”vs.“purchase”
“Movie”vs.“film”
MayretrieveirrelevantdocumentsthatincludePolysemyterms
(termswithmultiplemeaning).
“Apple”(companyvs.fruit)
“Bit”(unitofdatavs.actofeating)
“Bat”(baseballvs.mammal)
“Bank”(financialinstitutionvs.riverbank)
4/9/2024 38

Relevance Feedback
Afterinitialretrievalresultsarepresented,allowtheuserto
providefeedbackontherelevanceofoneormoreofthe
retrieveddocuments.
Usethisfeedbackinformationtoreformulatethequery.
Producenewresultsbasedonreformulatedquery.
Allowsmoreinteractive,multi-passprocess.
Relevancefeedbackcanbeautomatedinsuchawaythatit
allows:
Usersrelevancefeedback,
Pseudorelevancefeedback.
4/9/2024 39

Users Relevance Feedback
Architecture
1. Doc1 
2. Doc2 
3. Doc3 
.
.
Feedback
Rankings
IR
System
Document
corpus
Ranked
Documents
1. Doc1
2. Doc2
3. Doc3
.
.
Query
String
Revised
Query
ReRanked
Documents
1. Doc2
2. Doc1
3. Doc4
.
.
Query
Reformulation
4/9/2024 40

Challenges for IR researchers and
practitioners
Technicalchallenge:whattoolsshouldIRsystemsprovideto
alloweffectiveandefficientmanipulationofinformationwithin
suchdiversemediaoftext,image,videoandaudio?
Interactionchallenge:whatfeaturesshouldIRsystemsprovide
inordertosupportawidevarietyofusersintheirsearchfor
relevantinformation.
Evaluationchallenge:howcanwemeasureeffectivenessof
retrieval?whichtoolsandfeaturesareeffectiveandusable,
giventheincreasingdiversityofend-usersandinformation
seekingsituations?
4/9/2024 41

Assignments -One
Pickthreeofthefollowingconcept(whichisnottakenby
otherstudents).Reviewliteratures(books,articles&Internet)
(concerningthemeaning,function,prosandcons&
applicationoftheconcept).
1.Information Retrieval
2.Search engine
3.Data retrieval
4.Cross language IR
5.Multilingual IR
6.Document image retrieval
7.Indexing
8.Tokenization
9.Stemming
10.Stop words
11. Normalization
12. Thesaurus
13. Searching
14. IR models
15. Term weighting
16. Similarity measurement
17. Retrieval effectiveness
18.Query language
19. Relevance feedback
20. Query Expansion
4/9/2024 42

Question & Answer
4/9/2024 43

Thank You !!!
4/9/2024 44
Tags