Information retrieval systems irt ppt do

INFORMATION RETRIEVAL SYSTEMS
IV B.TECH -I SEMESTER (JNTUH-R15)
Ms. S.J. Sowjanya , Associate Professor , CSE
Mr. N.V.Krishna Rao, Associate Professor, CSE
Mr. C.Praveen Kumar, Assistant Professor, CSE
COMPUTER SCIENCE AND ENGINEERING
INSTITUTE OF AERONAUTICAL ENGINEERING
(Autonomous)
DUNDIGAL, HYDERABAD -500 043 1

UNIT 1
2

What isIR?
System-
centered
View
•IRisabranchofappliedcomputerscience
focusingontherepresentation,storage,
organization,access,anddistributionof
information.
•IRinvolveshelpingusersfindinformationthat
matchestheirinformationneeds.
User-centered
3

IRSystems
•IR systems contain threecomponents:
–System
–People
–Documents (informationitems)
System
User Documents
4

Data andInformation
•Data
–Stringofsymbolsassociatedwithobjects,
people,andevents
–Valuesofanattribute
•Dataneednothavemeaningtoeveryone
•Datamustbeinterpretedwithassociated
attributes.
5

Data andInformation
Information
–Themeaningofthedatainterpretedbyapersonor
asystem.
–Datathatchangesthestateofapersonorsystem
thatperceivesit.
–Datathatreducesuncertainty.
•ifdatacontainnouncertainty,thereareno
informationwiththedata.
•Examples:Itsnowsinthewinter.
Itdoesnotsnowthiswinter.
6

Information andKnowledge
•knowledge
–Structuredinformation
•throughstructuring,informationbecomes
understandable
–ProcessedInformation
•throughprocessing,informationbecomes
meaningfulanduseful
–informationsharedandagreeduponwithina
community
Data knowledge
information
7

InformationRetrieval
•Conceptually,informationretrievalisusedto
coverallrelatedproblemsinfindingneeded
information
•Historically,informationretrievalisabout
documentretrieval,emphasizingdocumentas
thebasicunit
•Technically,informationretrievalrefersto
(text)stringmanipulation,indexing,matching,
querying,etc.
8

Definition ofIRS
•AnInformationRetrievalSystemisasystemthatis
capableofstorageretrievalandmaintenanceof
information.
–Informationmaybeatext(includingnumericand
datedata),images,videoandothermultimedia
objects.
•Informationretrievalistheformalstudyofefficient
andeffectivewaystoextracttherightbitof
informationfromacollection.
–Thewebisaspecialcase,aswewilldiscuss.
9

•AnIRSconsistsofs/wprogramthatfacilitatesauserin
findingtheinfo.theuserneeds.
–Thesystemmayusestandardcomputerh/wtosupport
thesearchsubfunctionandtoconvertnon-textual
sourcestoasearchablemedia.
•ThesuccessofanIRSishowwellitcanminimizetheuser
overheadforausertofindtheneededinfo.
–Overheadfromuser‟sperspectiveisthetimerequiredto
findtheinfo.needed,excludingthetimeforactually
readingtherelevantdata.
–Thus,searchcomposition,searchexec.,&readingnon-
relevantitemsareallaspectsofIRoverhead.
10

•Tominimizetheoverheadofauserlocatingneededinfo
•Twomajormeasures
1.Precision:Theabilitytoretrievetop-ranked
documentsthataremostlyrelevant.
2.Recall:Theabilityofthesearchtofindallofthe
relevantitemsinthecorpus.
•Whenauserdecidestoissueasearchlookinginfoona
topic,thetotaldbislogicallydividedinto4segmentsas
11

–WhereNumber_Possible_Relevantaretheno.of
relevantitemsinthedb.
–Number_TotalRetrievedisthetotalno.ofitems
retrievedfromthequery.
–Number_Retrieved_Relevantistheno.ofitems
retrievedthatarerelevanttotheuser‟stotheuser‟s
searchneed.
–Precisionmeasuresoneaspectofinformationretrieved
overheadforauserassociatedwithaparticularsearch.
–Ifasearchhasa85%,then15%oftheusereffortis
overheadreviewingnon-relevantitems.
–Recallisaveryusefulconcept,butduetothe
denominatorisnon-calculableinoperationalsystems.
12

Systemevaluation
•Efficiency:time,space
•Effectiveness:
–Howisasystemcapableofretrievingrelevant
documents?
–Isasystembetterthananotherone?
•Metricsoftenused(together):
–Precision=retrievedrelevantdocs/retrieveddocs
–Recall=retrievedrelevantdocs/relevantdocs
relevant
retrievedrelevant
retrieved
13

General form ofprecision/recall
Precision
1.0
Recall
1.0
-Precision change w.r.t. Recall (not a fixedpoint)
-Systems cannot compare at one Precision/Recallpoint
-Average precision (on 11 points of recall: 0.0, 0.1, …,1.0)
14

Some techniques to improve IR
effectiveness
•Interactionwithuser(relevancefeedback)
-Keywordsonlycoverpartofthecontents
-Usercanhelpbyindicatingrelevant/irrelevantdocument
•Theuseofrelevancefeedback
–Toimprovequeryexpression:
Qnew=*Qold+*Rel_d-*Nrel_d
whereRel_d=centroidofrelevantdocumentsNRel_d=
centroidofnon-relevantdocuments
15

IR on theWeb
•Nostabledocumentcollection(spider,crawler)
•Invaliddocument,duplication,etc.
•Hugenumberofdocuments(partialcollection)
•Multimediadocuments
•Greatvariationofdocumentquality
•Multilingualproblem
16

Final remarks onIR
•IRisrelatedtomanyareas:
–NLP,AI,database,machinelearning,user
modeling…
–library,Web,multimediasearch,…
•Relativelyweektheories
•Verystrongtraditionofexperiments
•Manyremaining(andexciting)problems
•Difficultarea:Intuitivemethodsdonotnecessarily
improveeffectivenessinpractice
17

Why is IRdifficult
•Vocabulariesmismatching
–Synonymy:e.g.carv.s.automobile
–Polysemy:table
•Queriesareambiguous,theyarepartialspecificationof
user‟sneed
•Contentrepresentationmaybeinadequateandincomplete
•Theuseristheultimatejudge,butwedon‟tknowhowthe
judgejudges…
–Thenotionofrelevanceisimprecise,context-anduser-
dependent
•Buthowmuchitisrewardingtogain10%improvement!
18

Outline
•WhatistheIRproblem?
•HowtoorganizeanIRsystem?(Orthemain
processesinIR)
•Indexing
•Retrieval
19

Possibleapproaches
1.String matching (linear search in documents)
-Slow
-Difficult to improve
2.Indexing (*)
-Fast
-Flexible to further improvement
20

Retrieval
•Theproblemsunderlyingretrieval
–Retrievalmodel
•Howisadocumentrepresentedwiththe
selectedkeywords?
•Howaredocumentandqueryrepresentations
comparedtocalculateascore?
–Implementation
21

Idea forTFxIDF
•
TF:intra-clusteringsimilarityisquantifiedbymeasuringthe
rawfrequencyofatermkiinsideadocumentdj
–termfrequency(thetffactor)providesonemeasureof
howwellthattermdescribesthedocumentcontents
•
IDF:inter-clusteringsimilarityisquantifiedbymeasuring
theinverseofthefrequencyofatermkiamongthe
documentsinthecollection
22

Vector Model
•Indextermsareassignedpositiveandnon-binary
weights
•Theindextermsinthequeryarealsoweighted
d j(w1, j , w2, j,, wt , j)

q(w1,q , w2,q , , wt ,q)
•Termweightsareusedtocomputethedegreeof
similaritybetweendocumentsandtheuserquery
•Then,retrieveddocumentsaresortedindecreasingorder
23

Vector Model
•Advantages
–Itsterm-weightingschemeimprovesretrievalperformance
–Itspartialmatchingstrategyallowsretrievalofdocuments
thatapproximatethequeryconditions
–Itscosinerankingformulasortsthedocumentsaccording
totheirdegreeofsimilaritytothequery
•Disadvantage
–Theassumptionofmutualindependencebetweenindex
terms
24

Vector spacemodel
•Vector space = all the keywordsencountered
<t1,t2,t3, …,tn>
•Document
D =< a1, a2, a3, …,an>
ai = weight of ti inD
•Query
Q=
< b1, b2, b3, …,bn>
bi = weight of ti inQ
•R(D,Q) =Sim(D,Q)
25

Probabilistic Model
•IntroducedbyRoberstonandSparckJones,1976
–Binaryindependenceretrieval(BIR)model
•Idea:Givenauserqueryq,andtheidealanswersetRof
the
relevantdocuments,theproblemistospecifythe
propertiesforthisset
–Assumption(probabilisticprinciple):theprobability
ofrelevancedependsonthequeryanddocument
representationsonly;idealanswersetRshould
maximizetheoverallprobabilityofrelevance
–Theprobabilisticmodeltriestoestimatethe
probabilitythattheuserwillfindthedocumentdj
relevantwithratio
P(djrelevanttoq)/P(djnonrelevanttoq) 26

Probabilistic Model
•Definition
All index term weights are all binary i.e., wi,j {0,1}
Let R be the set of documents known to be relevant to
query q
Let R‟ be the complement ofR
Let(R|d)be the probability that thedocument
to the queryq
dj is nonelevantP(R|dj) to queryq
27

Probabilistic Model
•The similarity sim(dj,q) of the document dj to the query q
isdefined as theratio
Pr(R | d)
sim(d j , q)
j
Pr(R | d j)
28

Probabilistic Model
–Pr(ki|R)standsfortheprobabilitythattheindex
termkiispresentinadocumentrandomly
selectedfromthesetR
–Pr(ki|R)standsfortheprobabilitythattheindex
termkiisnotpresentinadocumentrandomly
selectedfromthesetR
29

UNIT-II
30

RelevanceFeedback
ThequeryisrepresentedbyavectorQ,eachdocumentis
representedbyavectorDi,andameasureofrelevance
betweenthequeryandthedocumentvectoriscomputed
asSC(Q,Di),whereSCisthesimilaritycoefficient.
ThebasicassumptionisthattheuserhasissuedaqueryQ
andretrievedasetofdocuments.
Theuseristhenaskedwhetherornotthedocumentsare
relevant.
Aftertheuserresponds,thesetRcontainsthenlrelevant
documentvectors,andthesetScontainsthen2non-
relevantdocumentvectors.
31

•Theideaisthattherelevantdocumentshaveterms
matchingthoseintheoriginalquery.
•Theweightscorrespondingtothesetermsareincreasedby
addingtherelevantdocumentvector.Termsinthequery
thatareinthenonrelevantdocumentshavetheirweights
decreased.
•Also,termsthatarenotintheoriginalquery(hadaninitial
componentvalueofzero)arenowaddedtotheoriginal
query.
32

•Onlythetoprankednon-relevantdocumentisused,instead
ofthesumofallnon-relevantdocuments.
•Aninterestingcaseoccurswhentheoriginalqueryretrieves
onlynon-relevantdocuments.
•Byincreasingtheweight,thetermnowringstrueandyields
somerelevantdocuments.
•Itisnotapplicabletoautomaticfeedbackasthetopn
documentsareassumed,bydefinition,toberelevant.
33

Whyclustering?
•Let‟slookattheprobleminadifferentangle
–Theissuehereisdealingwithhigh-dimensionaldata
•Howdopeopledealwithhigh-dimensionaldata?
–Startbyfindinginterestingpatternsassociatedwiththe
data
–Clusteringisoneofthewell-knowntechniqueswith
successfulapplicationsonlargedomainforfinding
patterns
•Butwhatisclustering?
34

Introduction
•The goal of clustering is to
–group data points that are close (or similar) to each other
–identify such groupings (or clusters) in an unsupervised
manner
•Unsupervised: no information is provided to the
algorithm on which data points belong to which
clusters
35

Hierarchicalclustering
•ModifiedfromDr.SeungchanKim‟sslides
•GiventheinputsetS,thegoalistoproduceahierarchy
(dendrogram)inwhichnodesrepresentsubsetsofS.
•Featuresofthetreeobtained:
–TherootisthewholeinputsetS.
–TheleavesaretheindividualelementsofS.
–Theinternalnodesaredefinedastheunionoftheir
children.
•Eachlevelofthetreerepresentsapartitionoftheinput
dataintoseveral(nested)clustersorgroups.
36

Hierarchicalclustering
37

Hierarchicalclustering
•Therearetwostylesofhierarchicalclusteringalgorithmsto
buildatreefromtheinputsetS:
–Agglomerative(bottom-up):
•Beginningwithsingletons(setswith1element)
•MergingthemuntilSisachievedastheroot.
•Itisthemostcommonapproach.
–Divisive(top-down):
•RecursivelypartitioningSuntilsingletonsetsare
reached.
38

Hierarchical clustering: forming
clusters
•Forming clusters fromdendograms
39

Hierarchicalclustering
•Advantages
–Dendogramsaregreatforvisualization
–Provideshierarchicalrelationsbetweenclusters
–Showntobeabletocaptureconcentricclusters
•Disadvantages
–Noteasytodefinelevelsforclusters
–Experimentsshowedthatotherclustering
techniquesoutperformhierarchicalclustering
40

N-Grams
•N-Gramsaresequencesoftokens.
•TheNstandsforhowmanytermsareused
–Unigram:1term
–Bigram:2terms
–Trigrams:3terms
•Youcanusedifferentkindsoftokens
–Characterbasedn-grams
–Word-basedn-grams
–POS-basedn-grams
•N-Gramsgiveussomeideaofthecontextaround
thetokenwearelookingat.
41

SimpleN-Grams
•AssumealanguagehasVwordtypesinitslexicon,how
likelyiswordxtofollowwordy?
–Simplestmodelofwordprobability:1/V
–Alternative1:estimatelikelihoodofxoccurringin
newtextbasedonitsgeneralfrequencyofoccurrence
estimatedfromacorpus(unigramprobability)
•popcornismorelikelytooccurthanunicorn
–Alternative2:conditionthelikelihoodofxoccurring
inthecontextofpreviouswords(bigrams,
trigrams,…)
mythicalunicornismorelikelythanmythicalpopcorn
42

UsingN-Grams
•For N-gram models
–P(wn-1,wn) = P(wn | wn-1) P(wn-1)
–By the chain rule we can decompose a joint
probability, e.g. P(w1,w2,w3)
•P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) …
P(wn-1|wn) …P(wn)
•For bigrams then, the probability of a sequence is just the
product of the conditional probabilities of its bigrams
P(the,mythical,unicorn) = P(unicorn|mythical)
P(mythical|the) P(the|<start>)
43

Applications
•Whydowewanttopredictaword,givensomepreceding
words?
–Rankthelikelihoodofsequencescontainingvarious
alternativehypotheses,e.g.forautomatedspeech
recognition,OCRing.
•Theatreownerssaypopcorn/unicornsaleshavedoubled...
–Assessthelikelihood/goodnessofasentence,e.g.for
textgenerationormachinetranslation
44

Regression Analysis:Introduction
Basicidea:
Usedatatoidentifyrelationshipsamongvariablesand
usetheserelationshipstomakepredictions.
45

Linearregression
•Lineardependence:constantrateofincreaseofonevariablewith
respecttoanother(asopposedto,e.g.,diminishingreturns).
•Regressionanalysisdescribestherelationshipbetweentwo(or
more)variables.
•Examples:
–Incomeandeducationallevel
–Demandforelectricityandtheweather
–Homesalesandinterestrates
•Ourfocus:
–Gainsomeunderstandingofthemechanics.
•theregressionline
•regressionerror
46

UNITIII
47

Outlines
•SemanticNetworks
•Parsing
•Cross Language InformationRetrieval
•Introduction
•Crossing the Languagebarrier
48

What is Cross-Language
Information Retrieval?
•Definition:Selectinformationinonelanguagebasedon
queriesinanother.
•Terminologies
–Cross-LanguageInformationRetrieval(ACMSIGIR96
WorkshoponCross-LinguisticInformationRetrieval)
–TranslingualInformationRetrieval(DefenseAdvanced
ResearchProjectAgency-DARPA)
49

An Architecture of Cross-
Language InformationRetrieval
50

Building Blocks forCLIR
Information Retrieval Artificial
Intelligence
Speech
Recognition
Information
Science
Computational Linguistics
51

Information Retrieval
•Filtering
•Relevance Feedback
•Documentrepresentation
•Latent semantic indexing
•Generalization vector spacemodel
•Collectionfusion
•Passage retrieval
52

InformationRetrieval
•Similaritythesaurus
•Local contextanalysis
•Automatic queryexpansion
•Fuzzy term matching
•Adapting retrieval methods tocollection
•Building cheap testcollection
•Evaluation
53

ArtificialIntelligence
•Machine translation
•Machine learning
•Template extraction andmatching
•Building large knowledgebases
•Semanticnetwork
54

SpeechRecognition
•Signal processing
•Pattern matching
•Phonelattice
•Background noiseelimination
•Speech segmentation
•Modeling speechprosody
•Building testdatabases
•Evaluation
55

Major Problems ofCLIR
•Queriesanddocumentsareindifferentlanguages.
–translation
•Wordsinaquerymaybeambiguous.
–disambiguation
•Queriesareusuallyshort.
–expansion
56

Major Problems ofCLIR
•Queriesmayhavetobesegmented.
–segmentation
•Adocumentmaybeintermsofvariouslanguages.
–languageidentification
57

Enhancing Traditional
Information Retrieval Systems
•Which part(s) should be modified forCLIR?
Documents Queries
(1) (3)
Document
Representation
Query
Representation
(2) (4)
Comparison
58

Cross-Language Information
Retrieval
Cross-Language InformationRetrieval
QueryTranslation DocumentTranslationNoTranslation
ControlledVocabularyFreeText TextTranslationVectorTranslation
Knowledge-based Corpus-based Hybrid
Ontology-basedDictionary-basedTerm-alignedSentence-alignedDocument-alignedUnaligned
Thesaurus-based ParallelComparable
59

Query Translation Based
CLIR
English
Query
Translation
Device
Retrieved
Chinese
Documents
Chinese
Query
Monolingual
Chinese
Retrieval
System
60

Hybrid Methods
•Ballesteros & Croft1997
Original Spanish
TREC Queries
human
translation
Spanish
Queries
English (BASE)
Queries
automatic
dictionary
translation
query
expansion
Spanish
Queries
query
expansion
English
Queries
automatic
dictionary
translation
Spanish
Queries
INQUERY
61

HybridMethods
–Performance Evaluation
•pre-translation
MRD (0.0823) vs. LF (0.1099) vs. LCA10(0.1139)
+33.5% +38.5%
•post-translation
MRD (0.0823) vs. LF (0.0916) vs. LCA20(0.1022)
+11.3% +24.1%
•combined pre-andpost-translation
MRD (0.0823) vs. LF (0.1242) vs. LCA20(0.1358)
+51.0% +65.0%
•32% below a monolingualbaseline
62

Hybrid Methods
•Davis 1997(TREC5)
EnglishQuery
Bilingual
Dictionary
Parallel
IREngine
Spanish
Equivalents
(POS)
TREC
Spanis
h
Corpu
s
UNEnglish
Doc
Vectors
UNSpanish
Mono
IR Engine
Reduced
Equivalent
Set 63

Document
Translation
•Translate the documents, not thequery
Documents Queries
MT
Document
Representation
Query
Representation
Comparison
(1)Efficiency Problem
(2)Retrieval Effectiveness??? (word
order, stop words)
(3)Cross-language mate finding
using MT-LSI (Dumais, et al,
1997)
64

VectorTranslation
•Translate documentvectors
Documents Queries
Document
Representation
MT
Query
Representation
Comparison
65

CLIR system usingquery
translation
66

Generating Mixed Ranked
Lists of Documents
•Normalizing scales ofrelevance
–using aligneddocuments
–using ranks
–interleaving according to givenratios
•Mapping documents into the samespace
–LSI
–documenttranslations
67

Types ofTools
•Character Set/FontHandling
•WordSegmentation
•Phrase/CompoundHandling
•TerminologyExtraction
•Parsers/LinguisticProcessors
•LexiconAcquisition
•MTSystems
•Summarization
•Mark-UpTools
•LanguageIdentification
•Stemming/Normalization
•Entity Recognition
•Part-of-Speechtaggers
•IndexingTools
•Text Alignment
•Speech Recognition/OCR
•Visualization
68

Character Set/FontHandling
•Input and Display Support
–Special input modules for e.g. Asian languages
–Out-of-the-box support much improved thanks to
modern web browsers
•Character Set/File Format
–Unicode/UTF-8
–XML
69

LanguageIdentification
•Different levels of multilingualdata
–In differentsub-collections
–Within sub-collections
–Within items
•Different approaches
–Tri-gram
–Stopwords
–Linguistic analysis
70

Stemming/Normalization
•Reductionofwordstotheirrootform
•Importantforlanguageswithrichmorphology
•Rule-basedordictionary-based
•Casenormalization
•Handlingofdiacritics(French,…)
•Vowel(re-)substitution(e.g.semiticlanguages,…)
71

Entity Recognition/
Terminology Extraction
•ProperNames,Locations,...
–Critical,sinceoftenmissingfromdictionaries
–SpecialproblemsinlanguagessuchasChinese
•Domain-specificvocabulary,technicalterms
–Criticalforeffectivenessandaccuracy
72

Phrase/CompoundHandling
•Collocations(“HongKong“)
–Importantfordictionarylookup
–Improvesretrievalaccuracy
•Compounds(“Bankangestelltenlohn“–bankemployee
salary)
–BigprobleminGerman
–Infinitenumberofcompounds–dictionaryisno
viablesolution
73

Lexicon Acquisition/
Text Alignment
•Goal:automaticconstructionofdatastructuressuchas
dictionariesandthesauri
–Workonparallelandcomparablecorpora
–Terminologyextraction
–Similaritythesauri
•Prerequisite:trainingdata,usuallyaligned
–Document,sentence,wordlevelalignment
74

Too many factors in
CLIR system evaluation
•translation
•automatic relevancefeedback
•termexpansion
•disambiguation
•result merging
•test collection
•need to tone it down to see whathappened
75

TREC-6 Cross-LanguageTrack
•In cooperation with the Swiss Federal Institute of
Technology (ETH)
•Task Summary: retrieval of English, French, and German
documents, both in a monolingual and a cross-lingual
mode
•Documents
-SDA (1988-1990): French (250MB), German (330 MB)
-Neue Zurcher Zeitung (1994): German (200MB)
–AP (1988-1990): English (759MB)
•13 participating groups
76

TREC-7 Cross-LanguageTrack
•Task Summary: retrieval of English, French, German
and Italian documents
•Results to be returned as a single multilingual ranked
list
•Addition of Italian SDA (1989-1990), 90 MB
•Addition of a subtask of 31,000 structured German
social science documents (GIRT)
•9 participating groups
77

TREC-8 Cross-LanguageTrack
•Tasks, documents and topic creation similar to TREC-7
•12 participating groups
78

CLIR inTREC-9
•Documents
–HongKongCommercialDaily,HongKongDaily
News,Takungpao:allfrom1999andabout260
MBtotal
•25 new topics built in English; translations
made to Chinese
79

Cross-LanguageEvaluationForum
•AcollaborationbetweentheDELOS Networkof
ExcellenceforDigitalLibrariesandtheUSNational
InstituteforStandardsandTechnology(NIST)
•ExtensionofCLIRtrackatTREC(1997-1999)
80

MainGoals
•Promoteresearchincross-languagesystemdevelopment
forEuropeanlanguagesbyprovidinganappropriate
infrastructurefor:
–CLIRsystemevaluation,testingandtuning
–Comparisonanddiscussionofresults
81

CLEF2000TaskDescription
•FourevaluationtracksinCLEF2000
–multilingualinformationretrieval
–bilingualinformationretrieval
–monolingual(non-English)informationretrieval
–domain-specificIR
82

CLEF2000DocumentCollection
•MultilingualComparableCorpus
–English:LosAngelesTimes
–French:LeMonde
–German:FrankfurterRundschau+DerSpeigel
–Italian:LaStampa
•Similarforgenre,content,time
83

3MinDigitalLibraries/Museums
•Multi-media
–Selectingsuitablemediatorepresentcontents
•Multi-linguality
–Decreasingthelanguagebarriers
•Multi-culture
–Integratingmultiplecultures
84

NPDMProject
•PalaceMuseum,Taipei,oneofthefamousmuseumsin
theworld
•NSCsupportsapioneerstudyofadigitalmuseum
projectNPDMstartingfrom2000
–EnamelsfromtheMingandCh‟ingDynasties
–FamousAlbumLeavesoftheSungDynasty
–IllustrationsinBuddhistScriptureswithRelative
Drawings
85

DesignIssues
•Standardization
–Astandardmetadataprotocolisindispensablefor
theinterchangeofresourceswithothermuseums.
•Multimedia
–Asuitablepresentationschemeisrequired.
•Internationalization
–tosharethevaluableresourcesofNPDMwithusers
ofdifferentlanguages
–toutilizeknowledgepresentedinaforeignlanguage
86

TranslingualIssue
•CLIR
–toallowuserstoissuequeriesinonelanguageto
accessdocumentsinanotherlanguage
–thequerylanguageisEnglishandthedocument
languageisChinese
•Twocommonapproaches
–Querytranslation
–Documenttranslation
87

ResourcesinNPDMpilot
•Anenamel,acalligraphy,apainting,oranillustration
•MICI-DC
–MetadataInterchangeforChineseInformation
–Accessiblefieldstousers
•Shortdescriptionsvs.fulltexts
•Bilingualversionsvs.Chineseonly
–Fieldsformaintenanceonly
88

SearchModes
•Freesearch
–usersdescribetheirinformationneedusingnatural
languages(ChineseorEnglish)
•Specifictopicsearch
–usersfillinspecificfieldsdenotingauthors,titles,
dates,andsoon
89

Example
•Informationneed
–Retrieval“TravelersAmongMountainsandStreams,
FanK„uan”(“范寬谿山行旅圖 ”)
•Possiblequeries
–Author:FanKuan;Kuan,Fan
–Time:SungDynasty
–Title:MountainsandStreams;Travelamongmountains;
Travelamongstreams;Mountainandstreampainting
–Freesearch:landscapepainting;travelers,huge
mountain,Nature;scenery;Shensiprovince
90

English
Query
Document
Translation
Query
Translatio
n
English
Names Name
Search
Specific
Bilingual
Dictionary
Machine
Transliteratio
n
Chinese
Names
Query
Disambiguatio
n
English
Titles Title
Search
Generic
Bilingual
Dictionary
Chinese
Titles
Chinese
Query
NPDM
ChineseIR
System
Collection
91

SpecificTopicSearch
•propernamesareimportantqueryterms
–Creatorssuchas“林逋”(LinP‟u),“李建中”(LiChien-
chung),“歐陽脩”(Ou-yangHsiu),etc.
–Emperorssuchas“康熙”(K'ang-hsi),“乾隆”(Ch'ien-
lung),“徽宗”(Hui-tsung),etc.
–Dynastysuchas”宋”(Sung),“明”(Ming),
“清”(Ch‟ing),etc.
92

NameTransliteration
•ThealphabetsofChineseandEnglisharetotally
different
•Wade-Giles(WG)andPinyinaretwofamous
systemstoromanizeChineseinlibraries
•backwardtransliteration
–Transliteratetargetlanguagetermsbackto
sourcelanguageones
–Chen,Huang,andTsai(COLING,1998)
–LinandChen(ROCLING,2000)
93

NameMappingTable
•DivideanameintoasequenceofChinesecharacters,and
transformeachcharacterintophonemes
•Lookupphoneme-to-WG(Pinyin)mappingtable,and
deriveacanonicalformforthename
94

NameSimilarity
•Extractnamedentityfromthequery
•Selectthemostsimilarnamedentityfromname
mappingtable
•Namingsequence/scheme
–LastNameFirstName1,e.g.,ChuHsi(朱熹)
–FirstName1LastName,e.g.,HsiChu(朱熹)
–LastNameFirstName1-FirstName2,e.g.,HsuTao-
ning
(許道寧)
–FirstName1-FirstName2LastName,e.g.,Tao-ning
Hsu(許道寧)
–Anyorder,e.g.,TaoNingHsu(許道寧)
–Anytransliteration,e.g.,JuShi(朱熹) 95

Title
•“Travelers among Mountains
and Streams”
•"travelers", "mountains", and
"streams" are basic components
•Userscanexpress
informationneedthrough
their
the
descriptions of a desired art
•System will measure the similarity
of art titles (descriptions) and a
query
96

FreeSearch
•Aqueryiscomposedofseveralconcepts.
•Conceptsareeithertransliteratedortranslated.
•ThequerytranslationsimilartoasmallscaleIR
system
•Resources
–Name-mappingtable
–Title-mappingtable
–SpecificEnglish-ChineseDictionary
–GenericEnglish-ChineseDictionary
97

Algorithm
•(1)Foreachresource,theChinesetranslationswhose
scoresarelargerthanaspecificthresholdareselected.
•(2)TheChinesetranslationsidentifiedfromdifferent
resourcesaremerged,andaresortedbytheirscores.
•(3)ConsidertheChinesetranslationwiththehighest
scoreinthesortingsequence.
–IftheintersectionofthecorrespondingEnglish
descriptionandqueryisnotempty,thenselectthe
translationanddeletethecommonEnglishterms
betweenqueryandEnglishdescriptionfromquery.
–Otherwise,skiptheChinesetranslation.
98

Algorithm
•(4)Repeatstep(3)untilqueryisemptyoralltheChinese
translationsinthesortingsequenceareconsidered.
•(5)Ifthequeryisnotempty,thenthesewordsarelookedup
fromthegeneraldictionary.AChinesequeryiscomposedof
allthetranslatedresults.
99

UNIT-IV
100

InvertedIndex
Query
Processing
SignatureFiles
Duplicate DocumentDetection
173
101

InvertedIndex
•Invertedindexeswereusedinbothearlyinformation
retrievalanddatabasemanagementsystemsinthe1960's.
•Insteadofscanningtheentirecollection,thetextis
preprocessedandalluniquetermsareidentified.
•Thislistofuniquetermsisreferredtoastheindex.
•Foreachterm,alistofdocumentsthatcontain
•thetermisalsostored.Thislistisreferredtoasapostinglist
102

103

•Thesizeoftheindexisanotherconcern.Manyindexcanbe
equaltosizeoftheoriginaltext.
•Thismeansthatstoragerequirementsaredoubledduetothe
index.
•Thesizeofpostinglistsintheinvertedindexcanbe
approximatedbytheZipfiandistribution.
•Zipfproposedthatthetermfrequencydistributionina
naturallanguageissuchthatifalltermswereorderedand
assignedarank
104

•UsingC/r,whereristherankandCisthevalueof
theconstant,anestimatecanbemadeforthe
numberofoccurrencesofagiventerm.
•TheconstantC,isdomain-specificandequalsthe
numberofoccurrencesofthemostfrequentterm.
105

BuildinganInvertedIndex
•Aninvertedindexconsistsoftwocomponents,alistofeach
distincttermreferredtoastheindexandasetoflists
referredtoaspostinglists.
•Thus,apostinglistcontainsasetoftuplesforeachdistinct
termin thecollection.Thesetoftuplesisof
theform<docid,if>foreachdistinctterminthecollection.
106

•Indexfile.Containstheactualpostinglistforeachdistinct
terminthecollection.Aterm,tthatoccursinidifferent
documentswillhaveapostinglist.
•Documentfile.Containsinformationabouteachdistinct
document---documentidentifier,longdocumentname,date
published,etc.
•Weightfile.Containstheweightforeachdocument.Thisis
thedenominatorforthecosinecoefficient-definedasthe
cosineoftheanglebetweenthequeryanddocumentvector
107

CompressinganInvertedIndex
•Akeyobjectiveinthedevelopmentofinvertedindexfilesis
todevelopalgorithmsthatreduceI/Obandwidthand
storageoverhead.
•Thesizeoftheindexfiledeterminesthestorageoverhead
imposed.
•Twoprimaryareasinwhichaninvertedindexmightbe
compressedarethetermdictionaryandthepostinglists.
108

FixedLengthIndexCompression
•Thisschemeeffectivelyreducesthedomainoftheidentifiers,
allowingthemtobestoredinamoreconciseformat.
•Foreachvaluetobecompressed,theminimumnumberof
bytesrequiredtostorethisvalueiscomputed.
•Fortermfrequencies,thereisnoconceptofusinganoffset
betweenthesuccessivevaluesaseachfrequencyis
independentoftheprecedingvalue.
109

VariableLengthIndex
Compression
•Inthismethod,thefrequencydistributionofalloftheoffsetsis
obtainedthroughaninitialpassoverthetext.
•Acompressionschemeisdevelopedbasedonthefrequency
distribution,andasecondpassusesthenewcompression
scheme.
•Thiscoderepresentsanintegerxwith2[log2x]+1bits.The
first[log2x]bitsaretheunaryrepresentationof[log2xJ]
110

VaryingCompressionBased
onPostingListSize
•Thegammaschemecanbegeneralizedasacoding
paradigmbasedonthevectorVwithpositiveintegersI
where$:Vi>=N.Tocodeintegerx>1relativetoV,find
k.
•Clearly,Vcanbechangedtogivedifferentcompression
characteristics.
•Lowvaluesinvoptimizecompressionforlownumbers,
whilehighervaluesinvprovidemoreresilienceforhigh
numbers.
111

QueryProcessing
InvertedIndexModifications
•Aninvertedindexcanbesegmentedtoallowforfastand
aquicksearchofapostinglisttolocatea
particular document.
•Asuggestedimprovementistocontinueprocessingall
thetermsinthequery,butonlyupdatetheweightsfound
intheddocuments.
•Also,afterthescoreforeverydocument,itisd
documentsareaccessed,thereisnoneedtoupdateonly
necessarytoupdatethescoreforthosedocumentsthat
haveanon-zeroscore.
112

PartialResultSetRetrieval
CutoffBasedonDocumentFrequency
•Thesimplestmeasureoftermqualityistorelyon
documentfrequency.
•Betweentwenty-fivetoseventy-fivepercentofthequery
termsaftertheyweresortedbydocumentfrequency
resultedinalmostnodegradationinprecisionandrecall
fortheTREC-4documentcollection.
113

VectorSpaceSimplifications
•Thefirstvariationwastoreplacethedocumentlength
normalizationthatisbasedonweightwiththesquareroot
ofthenumberoftermsinDi.
•Thesecondvariationwastosimplyremovethedocument
lengthnormalization.
•Thethirdmeasuredropstheidf.Thiseliminatesoneentry
intheindexforeachterm.
•Thefourthmeasuredropsthetfbutretainstheidf.This
eliminatestheneedtostorethetfineachentryofthe
postinglist.
•Thefifthandfinalmethodsimplycountsmatchesbetween
thequeryandtheterms.
114

SignatureFiles
•Theuseofsignaturefilesliesbetweenasequentialscan
oftheoriginaltextandtheconstructionofaninverted
index.
•Asignatureisanencodingofadocument.Theideaisto
encodealldocumentsasrelativelysmallsignatures.
•Constructionofasignatureisoftendonewithdifferent
hashingfunctions.
•Oneormorehashingfunctionsareappliedtoeachword
inthedocument.
115

116

•Thehashingfunctionisusedtosetabitinthesignature.
•Toimplementdocumentretrieval,asignatureisconstructed
forthequery.
•ABooleansignaturecannotstoreproximityinformationor
informationabouttheweightofatermasitappearsina
document.
•Signaturesareusefuliftheycanfitintomemory.
117

ScanningtoRemoveFalse
Positives
•Patternmatchingalgorithmsarerelatedtotheuseofscanning
ininformationretrievalsincetheystrivetofindapatternina
stringoftextcharacters.
•Typically,patternmatchingisdefinedasfindingallpositions
intheinputtextthatcontainthestartofagivenpattern.
•Ifthepatternisofsizepandthetextisofsizes,thenaïve
nestedlooppatternmatchrequiresO(ps)comparisons.
118

DuplicateDocumentDetection
•Amethodtoimprovebothefficiencyandeffectivenessofan
informationretrievalsystemistoremoveduplicatesornear
duplicates.
•Duplicatescanberemovedeitheratthetimedocumentsare
addedtoaninvertedindexoruponretrievingtheresultsofa
query.
•Thedifficultyisthatwedonotsimplywishtoremoveexact
duplicates,wemaywellbeinterestedinremovingnear
duplicatesaswell.
119

Finding SimilarDuplicates
•Whileitisnotpossibletodefinepreciselyatwhichpointa
documentisnolongeraduplicateofanother,researchers
haveexaminedseveralmetricsfordeterminingthesimilarity
ofadocumenttoanother.
•Thefirstisresemblance,TheresemblancerofdocumentDi
anddocumentDj,asdefinedtheintersectionoffeaturesover
theunionoffeaturesfromtwodocuments.
120

Shingles
•Thefirstnear-duplicatealgorithmwediscussistheuseof
shingles.
•Ashingleissimplyasetofcontiguoustermsinadocument
Shinglingtechniques,suchasCOPS,KOALA,andDSC
essentiallyallcomparethenumberofmatchingshingles.
•Thismakessensebecausesupershinglestendtobesomewhat
largeandwill,inalllikelihood,completelyencompassashort
document.
121

Duplicate Detection viaSimilarity
•Anotherapproachistosimplycomputethesimilarity
coefficientbetweentwodocuments.Ifthedocument
similarityexceedsathreshold
•Thedocumentscanbedeemedduplicatesofeachother.
•Theyrequireallpairsofdocumentstobecompared,i.e.each
documentiscomparedtoeveryotherdocumentanda
similarityweightiscalculated.
122

I-Match
•I-Match uses a hashing scheme that uses only some terms in a
document.
•The decision of which terms to use is key to the success of
the algorithm.
•I-match is a hash of the document that uses collection
statistics.
•The overall runtime of the I-Match approach is (O(d logd) in
the worstcase where all documents are duplicates of each
other
123

UNIT-V
124

A HistoricalProgression
•CombiningSeparateSystems
•Queriesareparsedandthe
•structuredportionsaresubmittedasaquerytothe
DBMS,whiletextsearch
•portionsofthequeryaresubmittedtoaninformation
retrievalsystem.
•Theresultsarecombinedandpresentedtotheuser.
125

•Itdoesnottakelongtobuildthissoftware,andsince
informationretrievalsystemsandDBMSarereadily
available,thisisoftenseenasanattractivesolution.
•ThekeyadvantageofthisapproachisthattheDBMSand
informationretrieval
•systemarecommercialproductsthatarecontinuously
improveduponbyvendors.
126

DataIntegrity
•DataintegrityissacrificedbecausetheDBMStransaction
logandtheinformationretrievaltransactionlogarenot
coordinated.Ifafailureoccursinthemiddleofanupdate
transaction,theDBMSwillendinastatewheretheentire
transactioniseithercompletedoritisentirelyundone.Itis
notpossibletocompletehalfofanupdate.
127

Portability
•Portabilityissacrificedbecausethequerylanguageisnot
standard.Presently,Astandardinformationretrievalquery
languagedoesnotexist.However,some
•workisbeingdonetodevelopstandardinformationretrieval
querylanguages.
•Ifoneexisted,itwouldrequiremanyyearsforwidespread
commercialacceptancetooccur.
128

Performance
•Run-timeperformancesuffersbecauseofthelackofparallel
processingandqueryoptimization.Althoughmost
commercialDBMShaveparallelimplementations,
•mostinformationretrievalsystemsdonot.
•QueryoptimizationexistsineveryrelationalDBMS.The
optimizer'sgoalistochoosetheappropriateaccesspathtothe
data.Arule-basedoptimizerusespre-definedrules,whilea
cost-basedoptimizerestimatesthecostofusingdifferent
accesspathsandchoosesthecheapestone.
129

Extensions toSQL
•Aninformationretrievalsystemtypicallyhidesthe
inverted
•indexassimplyanaccessstructurethatisusedtoobtain
data.Bystoring
•theindexasarelation,theauthorspointedoutthatusers
couldeasilyviewthe
•contentsoftheindexandmakechangesifnecessary.The
authorsmentioned
•extensions,suchasRELEVANCE(*),thatwouldcompute
therelevanceofadocumenttoaqueryusingsomepre-
definedrelevancefunction.
130

User-definedOperators
•User-definedoperatorsthatallowuserstomodifySQLby
addingtheirownfunctionstotheDBMSengine.
•Thedatatypeoftheargumentisgivenasrectangle.Hence,
thisexampleusesbothauser-definedfunctionandauser-
defineddatatype.
•Ex:1SELECTMAX(AREA(Rectangle))FROMSHAPE
•Thefollowingqueryobtainsalldocumentsthatcontainthe
termstermI,term2,andterm3:
•Ex:2SELECTDocJdFROMDOCWHERESEARCH-
TERM(Text,Terml,Term2,Term3)
131

Integrity
•Foruser-definedoperatorstobeefficient,theymustbe
linkedintothesamemoduleastheentireDBMS,giving
themaccesstotheentireaddressspaceoftheDBMS.
•Datathatresideinmemoryorondiskfilesthatare
currentlyopened,canbeaccessedbytheuser-defined
operator.
•Itispossiblethattheuser-definedoperatorcouldcorrupt
thesedata.
132

Portability
•Theoperatormayappeartoexist,butitmayperforman
entirelydifferentfunction.
•Withoutuser-definedoperators,anyonewithanRDBMS
maywriteanapplicationandexpectittorunatanysitethat
runsthatRDBMS.
•Withuser-definedoperators,thisperspectivechangesasthe
applicationislimitedtoonlythosesiteswiththeuser-
definedoperator.
133

Performance
•Queryoptimization,bydefault,doesnotknowmuchabout
thespecificuserdefinedoperators.
•Optimizationisoftenbasedonsubstantialinformation
aboutthequery.AquerywithanEQUALoperatorcanbe
expectedtoretrievefewerrowsthanaLESSTHAN
operator.
•Thisknowledgeassiststheoptimizerinchoosinganaccess
path.
134

Information Retrieval as a Relational
Application
•Thefollowingquerylistsalltheidentifiersofdocuments
thatcontainatleastonetermin
•QUERY:Ex:5SELECTDISTINCT(i.DocId)FROM
INDEXi,QUERYqWHEREi.term=q.term
•Aquerytoidentifydocumentsthatcontainanyofthe
termsinthequeryexceptthoseintheSTOP_TERM
relationisgivenbelow:
•Ex:6SELECTDISTINCT(i.DocId)FROMINDEXi,
QUERYq,STOP]ERMsWHEREi.term=termANDi.term
i=s.term
135

Preprocessing
•Apreprocessorthatreadstheinputfileandoutputsseparate
flatfilesisused.
•EachtermisreadandcheckedagainstalistofSGML
markers.Themainalgorithmforthepreprocessorsimply
parsestermsandthenappliesahashfunctiontohashthem
intoasmallhashtable.
•Ifthetermhasnotoccurredforthisdocument,anewentryis
addedtothehashtable.Collisionsarehandledbyasingle
linkedlistassociatedwiththehashtable.
136

A WorkingExample
•Thedocumentscontainbothstructuredandunstructured
dataandaregivenbelow.
•<DOC>
•<DOCNO>WSJ870323-0180<!DOCNO>
•<HL>Italy'sCommercialVehicleSales<IHL>
•<DD>03/23/87<!DD>
•<DATELINE>TURIN,Italy</DATELINE>
•<TEXT>
137

•Commercial-vehicle sales in Italy rose 11.4% in February
from a year earlier,
•to 8,848 units, according to provisional figures from the
Italian Association of Auto Makers.
•<!TEXT>
•<!DOC>
•<DOC>
•<DOCNO> WSJ870323-0161 <!DOCNO>
•<HL> Who's News: Du Pont Co. <IHL>
•<DD> 03/23/87 <!DD>
•<DATELINE> Du Pont Company, Wilmington, DE
</DATELINE>
•<TEXT>
138

BooleanRetrieval
•Forlargedocumentcollections,theyarelessusefulbecause
theresultsetisunordered,andaquerycanresultin
thousandsofmatches.
•TheuseristhenforcedtotunetheBooleanconditionsand
retrythequeryuntiltheresultisobtained.
•Relevancerankingavoidsthisproblembyranking
documentsbasedonameasureofrelevancebetweenthe
documentsandthequery.
•Theuserthenlooksatthetop-rankeddocumentsand
determineswhetherornottheyfilltheinformationneed.
139

Semi-Structured Search using
a Relational Schema
•XML-QL,aquerylanguagedevelopedatAT&T
[Deutschetal.,1999],wasdesignedtomeetthe
requirementsofafullfeaturedXMLquerylanguageset
outbytheW3C.
known todaywasreleasedin1999.
•ThespecificationdescribingXPathasitis
140

Static Relational Schema to
support XML-QL
•Thiswasfirstproposedin[FlorescuandKossman,1999]to
providesupportforXMLqueryprocessing.
•Later,intheIITInformationRetrievalLaboratory
www.ir.iit.edu).itwasshownthatafullXML-QLquery
languagecouldbebuiltusingthisbasicstructure.
•Thisisdonebytranslatingsemi-structuredXML-QLto
SQL.Theuseofastaticschemaaccommodatesdataofany
XMLschemawithouttheneedfordocument-type
definitionsorXschemas.
141

The hierarchy of XML documents is kept in tact such that
any document indexed into the database can be reconstructed
using only the information in the tables. The relations us are:
TAG_NAME ( TagId, tag) ATTRIBUTE( AttributeId,
attribute)
TAGYATH ( TagId, path) DOCUMENT ( Doc/d, fileName) INDEX
( Id, parent, path, type, tagId, attrId, pos, value
142

Distributed InformationRetrieval
ATheoreticalModelofdistributedretrieval.
WebSearch.
143

Introduction
Fig : Distributed Document Retrieval
144

Centralized InformationRetrieval
SystemModel
•Formally,aninformationretrievalsystemisdefinedasa
triple,I=(D,R,5)whereDisadocumentcollection,Ris
thesetofqueries,and5j:Rj---t2Djisamapping
assigningthehquerytoasetofrelevantdocuments.
•Manyinformationretrievalsystemsrelyuponathesaurus,
inwhichauserqueryisexpanded,toincludesynonymsof
thekeywordstomatchsynonymsinadocument.Hence,a
querythatcontainsthetermcurtainwillalsoinclude
documentscontainingthetermdrapery.
145

•Toincludethethesaurusinthemodel,itwasproposedin
[Turski,1971]that
•thetriplebeexpandedtoaquadrupleas:
•1=(T,D,R,$)
•whereTisasetofdistincttermsandtherelationpeTxT
suchthatp(tl,t2)
•impliesthattlisasynonymoft2.Usingthesynonym
relation,itispossibleto
•representdocumentsasasetofdescriptorsandasetof
ascriptors.Considera
•documentDI,thesetofdescriptorsdconsistsofallterms
inDIsuchthat:
•Eachdescriptorisunique
•Nodescriptorisasynonymofanotherdescriptor
146

•Anascriptorisdefinedasatermthatisasynonymofa
descriptor.
•Eachascriptormustbesynonymouswithonlyone
descriptor.Hence,thedescriptorsrepresentaminimal
descriptionofthedocument.
•Inadditiontothethesaurus,ageneralizationrelationover
thesetsofdescriptorsisdefinedas
•'"'(Cdxdwhere'"'((tl,t2)impliesthattlisamoregeneral
termthant2'Hence,'"'((animal,dog)isanexampleofa
validgeneralization.
147

Distributed Information Retrieval
System Model
•Thecentralizedinformationretrievalsystemcanbe
partitionedintonlocalinformationretrievalsystems81,82
,...,8n[Mazur,1984].Eachsystem8jisoftheform:8j=
(Tj,Dj,Rj,6"j),whereTjisthethesaurus;Djisthe
documentcollection;Rjthesetofqueries;and6"j:Rj--t
2Djmapsthequeriestodocuments.
•Bytakingtheunionofthelocalsites,itispossibleto
definethedistributedinformationretrievalsystem
148

•Consider a set of documents with the
following descriptors:
•D1 = (Mary, Harold, Herbert)
•D2 = (Herbert, dog)
•D3 = (people, dog)
•D4 = (Mary, cheshire)
•D5 = (Mary, dog)
•D6 = (Herbert, black-cat, doberman)
•D7 = (Herbert, doberman)
149

•Aqueryofthemostgeneralterms(people,animal)should
returnalldocuments2through7(document1containsno
animals,andthequeryiseffectivelyaBooleanAND).
However,thehierarchygivenaboveas81willonly
retrievedocumentsD3andD5,and82willonlyretrieve
documentsD6andD7.Hence,documentsD2andD4are
missingfromthefinalresultifthelocal
•Results sets are simplyconcatenated.
150

WebSearch
•EvaluationofWebSearchEngines
ToautomaticallyevaluateWebsearchengines,amethod
usingonlinetaxonomiesthatwerecreatedaspartof
OpenDirectoryProject(ODP)isdescribedin[Beitzelet
al.,2003b].Onlinedirectorieswereusedasknown
relevantitemsforaquery.Ifaquerymatcheseitherthe
titleoftheitemstoredorthedirectoryfilename
containingaknownitemthenitisconsideredamatch.
151

High PrecisionSearch
•Traditionally,precisionandrecallmeasuresarethemain
evaluationmetrics,whileresponsetimeandspace
requirementsarelikelyaddressed.However,intheWeb
environment,responsetimeiscritical.Furthermore,recall
estimationisverydifficult,andprecisionisoflimited
concernsincemostusersneveraccessanylinksthatappear
beyondthefirstanswerscreen
•TheWebenvironmentprovidesformanynewopportunities
torevisitoldissuesparticularlyintermsofperformanceand
accuracyoptimizationsandevaluationmeasuresofsearch
accuracy.Inthatlight,recently,anhourlyanalysisofavery
largetopicallycategorized
152

PageRank
•ItextendsthenotionofhubsandauthoritiesintheWebgraph
originallydescribedin[Kleinberg,1999].PageRankisatthe
heartofthepopularWebsearchengine,Google.Essentially,
thePageRankalgorithmusesincomingandoutgoinglinks
toadjustthescoreofaWebpagewithrespecttoits
popularity,independentoftheuser'squery.Hence,ifa
traditionalretrievalstrategymighthavepreviouslyranked
twodocumentsequal,thePageRankalgorithmwillboostthe
similaritymeasureforapopulardocument.
153

•Here,popularisdefinedashavinganumberofotherWeb
pageslinktothedocument.Thisalgorithmworkswellon
Webpages,buthasnobearingondocumentsthatdonot
haveanyhyperlinks.ThecalculationofPageRankforpage
AoverallpageslinkingtoitDl...Dnisdefinedas
follows:
PageRank(A)=(1-d)+dz=Dl...Dn
PageRank(Di)/C(Di)
•WhereC(Di)isthenumberoflinksoutfrompageDiandd
isadampening
154

Fig : Simple Page Rank Calculation
155

Improving Effectiveness of
Web Search Engines
•Compressionoftheinvertedindexisthesame,partial
relevancerankingisthesame,etc.
•However,therewereandaresomeeffortsspecifically
focusedonimprovingtheperformanceofWeb-based
informationretrievalsystems.
•Intermsofaccuracyimprovements,itisreasonableto
believethatbysending
•Arequesttoavarietyofdifferentsearchenginesand
mergingtheobtainedresultsonecouldimprovethe
accuracyofaWebsearchengine.
156

Information retrieval systems irt ppt do

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Information retrieval systems irt ppt do

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77