ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf

Web Information Retrieval
1
Tanveer J Siddiqui
J K Institute of Applied Physics & Technology
University of Allahabad

Example of a query
Engine 1
2

Topics
–Algorithmic issues in
classic information
retrieval
(IR), e.g. indexing,
–issues related to the
Web IR
3
(IR), e.g. indexing,
stemming, etc.

Information Retrieval
Information retrieval (IR) deals with the
organization, storage, retrieval and
evaluation of information relevant to
user’s query.
4
user’s query.
A user having an information need
formulates a request in the form of query
written in natural language. The retrieval
system responds by retrieving document
that seems relevant to the query

Information Retrieval
Traditionally it has been accepted that
information retrieval system does not return
the actual information but the documents
containing that information. As Lancaster
pointed out:
5
pointed out:
‘An information retrieval system does not inform
(i.e. change the knowledge of) the user on the
subject of her inquiry. It merely informs on the
existence (or non-existence) and whereabouts
of documents relating to her request.’

A question answering system provides user
with the answers to specific questions.
Data retrieval systems retrieve precise data
6
Data retrieval systems retrieve precise data
An information retrieval system does not
search for specific data as in data retrieval
system, nor search for direct answers to a
question as in question answering system.

Information Retrieval Process
7

Classic IR
Input: Document collection
Goal: Retrieve documents or text with
information content that is relevant to
user’s information need
8
user’s information need
Two aspects:
1. Processing the collection
2. Processing queries (searching)

Information Retrieval Model
An IR model is a pattern that defines
several aspects of retrieval procedure,
for example,
9
how the documents and user’s queries
are represented
how system retrieves relevant
documents according to users’ queries &
how retrieved documents are ranked.

IR Model
An IR model consists of
-a model for documents
-a model for queries and
10
-a model for queries and
-a matching function which compares
queries to documents.

Classical IR Model
IR models can be classified as:
Classical models of IR
Non-Classical models of IR
11
Non-Classical models of IR
Alternative models of IR

Classical IR Model
based on mathematical knowledge that
was easily recognized and well understood
simple, efficient and easy to implement
The three classical information retrieval
12
The three classical information retrieval
models are:
-Boolean
-Vector and
-Probabilistic models

Non-Classical models of IR
Non-classical information retrieval models
are based on principles other than
similarity, probability, Boolean operations
etc. on which classical retrieval models are
13
etc. on which classical retrieval models are
based on.
information logic model, situation theory
modeland interaction model.

Alternative IR models
Alternative models are enhancements of
classical models making use of specific
techniques from other fields.
14
Example:
Cluster model, fuzzy model and latent
semantic indexing (LSI) models.

Information Retrieval Model
The actual text of the document and query is
not used in the retrieval process. Instead,
some representation of it.
Document representation is matched with
15
Document representation is matched with
query representation to perform retrieval
One frequently used method is to represent
document as a set of index terms or keywords

Indexing
The process of transforming document
text to some representation of it is known
as indexing.
16
Different index structures might be used.
One commonly used data structure by IR
system is inverted index.

inverted index
An inverted index is simply a list of
keywords (tokens/terms), with each
keyword having pointers to the
documents containing that keyword
17
documents containing that keyword
-A document is tokenized. A nonempty
sequence of characters not including
spaces and punctuations can be
regarded as tokens

inverted index
-Each distinct token in the collection may
be represented as a distinct integer
-A document is thus transformed into a
18
-A document is thus transformed into a
sequence of integers

Example : Inverted index
D1 The weather is cool.
D2 She has a cool mind.
The inverted index can be
represented as a table
tid did pos
1 (The) 1 1
2 (Weather) 1 2
3 (is) 1 3
4 (cool) 1 4
19
represented as a table
called POSTING.
4 (cool) 1 4
5 (She) 2 1
6 (has) 2 2
7 (a) 2 3
4 (cool) 2 4
8 (mind) 2 5

The d1
weather d1
Example : Inverted index
The d1/1
weather d1/2
is d1/3
20
weather d1
is d1
cool d1,d2
She d2
has d2
a d2
mind d2
is d1/3
cool d1/4,d2/4
She d2/1
has d2/2
a d2/3
mind d2/5

The computational cost involved in
adopting a full text logical view (i.e. using
full set of words to represent a document)
is high.
21
is high.
Hence, some text operations are usually
performed to reduce the set of
representative keywords.

Two most commonly used text
operations are:
1. Stop word eliminationand
22
1. Stop word eliminationand
2. Stemming
Zipf’slaw

Stop word eliminationinvolves removal
of grammatical or function words while
stemming reduces distinct words to
their common grammatical root
23
their common grammatical root

Indexing
Mostoftheindexingtechniquesinvolve
identifyinggooddocumentdescriptors,
suchaskeywordsorterms,todescribe
informationcontentofthedocuments.
24
informationcontentofthedocuments.
Agooddescriptorisonethathelpsin
describingthecontentofthedocument
andindiscriminatingthedocumentfrom
otherdocumentsinthecollection.

Termcan be a single word or it can be multi-
word phrases.
Example:
Design Features of Information
25
Design Features of Information
Retrieval systems
can be represented by the set of terms :
Design, Features, Information,
Retrieval, systems
or by the set of terms:
Design, Features, Information Retrieval,
Information Retrieval systems

Luhn’s early Assumption
Luhn assumed that frequency of word
occurrence in an article gives meaningful
identification of their content.
26
discrimination power for index terms is a
function of the rank order of their
frequency of occurrence

Stop Word Elimination
Stop words are high frequency words,
which have little semantic weight and are
thus unlikely to help with retrieval.
27
Such words are commonly used in
documents, regardless of topics; and
have no topical specificity.

Example :
articles (“a”, “an” “the”) and prepositions
(e.g. “in”, “of”, “for”, “at” etc.).
28
(e.g. “in”, “of”, “for”, “at” etc.).

Stop Word Elimination
Advantage
Eliminating these words can result in
considerable reduction in number of index
terms without losing any significant information.
29
terms without losing any significant information.
Disadvantage
It can sometimes result in elimination of terms
useful for searching, for instance the stop word
A in Vitamin A.
Some phrases like “to be or not to be”consist
entirely of stop words.

Stop Words
About,above,accordingly,afterwards,
again,against,alone,along,already,
am,among,amongst,and,another,
any,anyone,anything,anywhere,
30
any,anyone,anything,anywhere,
around,as,aside,awfully,be,because

Stemming
Stemming normalizes morphological
variants
It removes suffixes from the words to
31
It removes suffixes from the words to
reduce them to some root form e.g. the
words compute, computing, computes
and computerwill all be reduced to same
word stem comput.

Stemming
Porter Stemmer(1980).
Example:
The stemmed representation of
32
The stemmed representation of
Design Features of Information
Retrieval systems
will be
{design, featur, inform, retriev,
system}

Stemming
stemming throws away useful distinction.
In some cases it may be useful to help
conflate similar terms resulting in
increased recallin others it may be harmful
33
increased recallin others it may be harmful
resulting in reduced precision

Zipf’s law
Zipf law
frequency of words multiplied
by their ranks in a large corpus
34
by their ranks in a large corpus
is approximately constant, i.e.
constant rank frequency 

35
Relationship between frequency of words and its rank order

Luhn’s assumptions
Luhn(1958) attempted to quantify the
discriminating power of the terms by associating
their frequency of occurrence (term frequency)
within the document. He postulated that:
36
-The high frequency words are quite common
(function words)
-low frequency words are rare words
-medium frequency words are useful for
indexing

Boolean model
the oldest of the three classical models.
is based on Boolean logic and classical
set theory.
37
set theory.
represents documents as a set of
keywords, usually stored in an inverted
file.

Boolean model
Users are required to express their
queries as a boolean expression
consisting of keywords connected with
boolean logical operators (AND, OR,
38
boolean logical operators (AND, OR,
NOT).
Retrieval is performed based on whether
or not document contains the query
terms.

Boolean model
Given a finite set
T = {t1, t2, ...,ti,...,tm}
of index terms, a finite set
39
of index terms, a finite set
D = {d1, d2, ...,dj,...,dn}
of documents and a boolean expression in
a normal form -representing a query Q as
follows:
},{θi),θ ( iii ttQ 

Boolean model
1. The set R
iof documents are obtained
that contain or not term t
i:
R
i= { }, jj did| ,},{ iitti 
40
R
i= { },
where
2. Set operations are used to retrieve
documents in response to Q:
jj did| },{tti 
jiji dtdt  means
iR

Example: Boolean model
Let the set of original documents be
D = {D1, D2, D3}
Where
D1 = Information retrieval is concerned with the
organization, storage, retrieval and evaluation of
41
organization, storage, retrieval and evaluation of
information relevant to user’s query.
D2 = A user having an information need formulates a
request in the form of query written in natural language.
D3 = The retrieval system responds by retrieving document
that seems relevant to the query.

Example: Boolean model
Let the set of terms used to represent these
documents be:
T = {information, retrieval, query}
Then the set D of document will be represented
as follows:
42
as follows:
D = { d1, d2, d3}
Where,
d1 = {information, retrieval, query}
d2 = {information, query}
d3 = {retrieval, query}

Example (Contd.)
Let the query Q be:
First, the following sets S1 and S2 of documents
are retrieved in response to Q:
retrievalninformatio  Q
43
are retrieved in response to Q:
S1 = { } = {d1, d2}
S2 = { } = {d1, d3}
Then, the following documents are retrieved in
response to query Q:
{ } = {d1}
jjd d on|informati
jjd d |retrieval
21jj SS d|d 

Boolean model
the model is not able to retrieve
documents partly relevant to user query
unable to produce a ranked list of
44
unable to produce a ranked list of
documents.
users seldom comprise their query with
pure Boolean expression that this model
requires.

Extended Boolean model
To overcome weaknesses of Boolean
model, numerous extensions have been
suggested that do provide a ranked list
of documents.
45
of documents.
Discussion of these extensions are
beyond the scope of this tutorial.
Ref. P-norm model (Salton et al., 1983)
and Paice model (Paice, 1984).

Vector Space Model
It represents documents and queries as
vectors of features representing terms
features are assigned some numerical
value that is usually some function of
46
value that is usually some function of
frequency of terms
Ranking algorithm compute similarity
between document and query vectors to
yield a retrieval score to each document.

Vector Space Model
Given a finite set of n documents:
D = {d
1, d
2, ...,d
j,...,d
n}
and a finite set of m terms:
47
T = {t
1, t
2, ...,t
i,...,t
m}
Each document will be represented by a
column vector of weights as follows:
(w
1j, w
2j, w
3j, . . w
ij, … w
mj)
t
w
ijis weight of term t
i in document d
j.

Vector Space Model
The document collection as a whole will
be represented by an m x n term–
document matrix as:
48














mnmjm2 m1
iniji2i1
2n 2j2221
1n1j 12 11
w... w.... ww
w... w.... ww
w... w.... ww
w... w.... ww

Example: Vector Space Model
D1 = Information retrieval is concerned with the
organization, storage, retrieval and evaluation
of information relevant to user’s query.
D2 = A user having an information need
49
D2 = A user having an information need
formulates a request in the form of query written
in natural language.
D3 = The retrieval system responds by retrieving
document that seems relevant to the query.

Example: Vector Space Model
Lettheweightsbeassignedbasedon
thefrequencyofthetermwithinthe
document.
50
Theterm–documentmatrixis:










1 1 1
1 0 2
0 1 2

Vector Space Model
Raw term frequency approach gives too
much importance to the absolute values
of various coordinates of each document
51

Consider two document vectors
(2, 2, 1)
t
(4, 4, 2)
t
52
(4, 4, 2)
The documents look similar except the
differences in magnitude of term
weights.

Normalizing term weights
To reduce the importance of the length
of document vectors we normalize
document vectors
Normalization changes all the vectors to
53
Normalization changes all the vectors to
a standard length.
We can convert document vectors to unit
length by dividing each dimension by the
overall length of the vector.

Normalizingtheterm-documentmatrix:










1 1 1
1 0 2
0 1 2
54
Weget
Elementsofeachcolumnaredividedbythe
lengthofthecolumnvector( )

1 1 1










0.71 0.71 0.33
0.71 0 0.67
0 0.71 0.67

i
ijw
2

Term weighting
Luhn’s postulate can be refined by noting that:
1. The more a document contains a given word
the more that document is about a concept
represented by that word.
55
represented by that word.
2. The less a term occurs in particular
document in a collection, the more
discriminating that term is.

Term weighting
The first factor simply means that terms
that occur more frequently represent its
meaning more strongly than those
occurring less frequently
56
occurring less frequently
The second factor considers term
distribution across the document
collection.

Term weighting
a measure that favors terms appearing in
fewer documents is required
The fraction n/n
i, exactly gives this measure
where,
57
where,
n is the total number of the document in the
collection
& n
iis the number of the document in which
term i occurs

Term weighting
As the number of documents in any collection
is usually large, log of this measure is usually
taken, resulting in the following form of inverse
document frequency (idf) term weight:
58
document frequency (idf) term weight:

Tf-idf weighting scheme
tf -document specific statistic
59
tf -document specific statistic
idf -is global statistic and attempts to include
distribution of term across document
collection.

Tf-idf weighting scheme
The term frequency (tf) component is
document specific statistic that
measures the importance of term within
the document
60
the document
The inverse document frequency (idf) is
global statistic and attempts to include
distribution of term across document
collection.

Tf-idf weighting scheme
Example: Computing tf-idf weight
61

Normalizing tf and idf factors
by dividing the term frequency by the
frequency of the most frequent term in the
document
idf can be normalized by dividing it by the
62
idf can be normalized by dividing it by the
logarithm of the collection size (n).

Term weighting schemes
A third factor that may affect weighting function
is the document length
the weighting schemes can thus be
characterized by the following three factors:
63
characterized by the following three factors:
1. Within-document frequency or term
frequency
2. Collection frequency or inverse document
frequency
3. Document length

term weighting scheme can be
represented by a triple ABC
A -tf component
64
A -tf component
B -idf component
& C -length normalization component.

Different combinations of options can be
used to represent document and query
vectors.
65
The retrieval model themselves can be
represented by a pair of triples like
nnn.nnn (doc = “nnn”, query = “nnn”)

Options for the three weighting
factors
Term frequency within document (A)
n Raw term frequency tf = tfij
b tf = 0 or 1 (binary weight)
 tf
66
a Augmented term frequency
l tf = ln(tfij) + 1.0 Logarithmic term frequency
Inverse Document frequency (B)
n wt = tf no conversion
t Multiply tf with idf







j
ij
Din max tf
tf
0.5 0.5 tf

Options for the three weighting
factors
Document length (C)
n w
ij = wt (no conversion)
c w
ij is obtained by dividing each
67
c w
ij is obtained by dividing each
wt by sqrt(sum of(wts squared))

Indexing Algorithm
Step 1. Tokenization: This extracts individual terms
(words) in the document, converts all the words in the
lower case and removes punctuation marks. The output
of the first stage is a representation of the document as a
stream of terms.
68
stream of terms.
Step 2. Stop word elimination: Removes words that
appear more frequently in the document collection.
Step 3. Stemming: reduce remaining terms to their
linguistic root, to get index terms.
Step 4. Term weighting: Assigns weights to term
according to their importance in the document, in the
collection or some combination of both.

Example: Document Representation
Stemmed terms Document 1 Document 2 Document 3
69
Stemmed terms Document 1 Document 2 Document 3
inform 0 0 1
intellig 0 0 1
model 1 1 0
probabilist 0 1 0
retriev 0 1 1
space 1 0 0
technique 0 0 1
vector 1 0 0

Similarity Measures
70
inner product
cosine similarity
),(),(
1
ik
m
i
ijkjkj wwqdqdsim 



),(
),(
1
2
1
2
1







m
i
ij
m
i
ik
ik
m
i
ij
kj
kj
kj
ww
ww
qd
qd
qdsim

Evaluation of IR Systems
The evaluation of IR system is the process of assessing
how well a system meets the information needs of its
users (Voorhees, 2001).
Criteria's for evaluation
-Coverage of the collection
71
-Coverage of the collection
-Time lag
-Presentation format
-User effort
-Precision
-Recall

Evaluation of IR Systems
The IR evaluation models can be broadly
classified as system drivenmodel and user-
centeredmodel.
System driven model focus on measuring how
72
System driven model focus on measuring how
well the system can rank documents
user–centered evaluation model attempt to
measure the user’s satisfaction with the
system.

Why System Evaluation?
There are many retrieval models/
algorithms/ systems, which one is the best?
What is the best component for:
•
73
What is the best component for:
•Ranking function (dot-product, cosine, …)
•Term selection (stop word removal, stemming…)
•Term weighting (TF, TF-IDF,…)
How far down the ranked list will a user
need to look to find some/all relevant
documents?

Evaluation of IR Systems
TraditionalgoalofIRistoretrievealland
onlytherelevantdocumentsinresponseto
aquery
Allismeasuredbyrecall:theproportionof
74
Allismeasuredbyrecall:theproportionof
relevantdocumentsinthecollectionwhich
areretrieved
Onlyismeasuredbyprecision:the
proportionofretrieveddocumentswhichare
relevant

Precision vs. Recall
Retrieved
All docs
RelRetrieved
75
Relevant
|Relevant|
||
Recall
edRelRetriev

|Retrieved|
||
Precision
edRelRetriev


Trade-off between Recall and
Precision
1
Precision
The ideal
Returns relevant documents but
misses many useful ones too
76
1
0
Recall
Precision
Returns most relevant
documents but includes
lots of junk

Test collection approach
The total number of relevant documents in a
collection must be known in order for recall to
be calculated.
To provide a framework of evaluation of IR
77
To provide a framework of evaluation of IR
systems, a number of test collections have
been developed (Cranfield, TREC etc.).
These document collections are accompanied
by a set of queries and relevance judgments.

IR test collections
Collection Number of documents Number of queries
Cranfield 1400 225
CACM 3204 64
CISI 1460 112
LISA 6004 35
78
TIME 423 83
ADI 82 35
MEDLINE 1033 30
TREC-1 742,611 100
___________________________________________________________

Fixed Recall Levels
One way to evaluate is to look at average
precision at fixed recall levels
•Provides the information needed for
79
•Provides the information needed for
precision/recall graphs

Document Cutoff Levels
Another way to evaluate:
•Fix the number of documents retrieved at several levels:
•top 5
•top 10
•top 20
80
•top 20
•top 50
•top 100
•Measure precision at each of these levels
•Take (weighted) average over results
focuses on how well the system ranks the first k
documents.

Computing Recall/Precision Points
Foragivenquery,producetherankedlist
ofretrievals.
Markeachdocumentintherankedlistthat
isrelevantaccordingtothegoldstandard.
81
isrelevantaccordingtothegoldstandard.
Computearecall/precisionpairforeach
positionintherankedlistthatcontainsa
relevantdocument.

Computing Recall/Precision Points:
An Example
ndoc #relevant
1588 x
2589 x
3576
4590 x
5986
Let total # of relevant docs = 6
Check each new recall point:
R=1/6=0.167;P=1/1=1
R=2/6=0.333;P=2/2=1
82
R=3/6=0.5; P=3/4=0.75
5986
6592 x
7984
8988
9578
10985
11103
12591
13772 x
14990
R=2/6=0.333;P=2/2=1
R=5/6=0.833;p=5/13=0.38
R=4/6=0.667; P=4/6=0.667
Missing one
relevant document.
Never reach
100% recall

Interpolating a Recall/Precision Curve
Interpolate a precision value for each
standard recall level:
•r
j{0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9,
1.0}
83
1.0}
•r
0= 0.0, r
1= 0.1, …, r
10=1.0
The interpolated precision at the j-th
standard recall level is the maximum known
precision at any recall level greater than or
equal to j-th level.

Example: Interpolated Precision
Precision at observed recall points:
Recall Precision
0.25 1.0
0.4 0.67
0.55 0.8
The interpolated precision :
0.0 1.0
0.1 1.0
0.2 1.0
0.3 0.8
0.4 0.8
84
0.55 0.8
0.8 0.6
1.0 0.5
0.4 0.8
0.5 0.8
0.6 0.6
0.7 0.6
0.8 0.6
0.9 0.5
1.0 0.5
Interpolated average precision =
0.745

Recall-Precision graph
85

Average Recall/Precision Curve
Computeaverageprecisionateach
standardrecalllevelacrossallqueries.
Plotaverageprecision/recallcurvesto
86
Plotaverageprecision/recallcurvesto
evaluateoverallsystemperformanceon
adocument/querycorpus.

Average Recall/Precision Curve
Model
1. doc = “atn”, query = “ntc”
2. doc = “atn”, query = “atc”
3. doc = “atc”, query = “atc”
87
3. doc = “atc”, query = “atc”
4. doc = “atc”, query = “ntc”
5. doc = “ntc”, query = “ntc”
6. doc = “ltc”, query = “ltc”
7. doc = “nnn”, query= “nnn”

Problems with Precision/Recall
Can’t know true recall value
•except in small collections
Precision/Recall are related
•
88
•A combined measure sometimes more
appropriate
Assumes batch mode
•Interactive IR is important and has different
criteria for successful searches
Assumes a strict rank ordering matters.

Other measures: R-Precision
R-Precision is the precision after R
documents have been retrieved, where R
is the number of relevant documents for
89
is the number of relevant documents for
a topic.
•It de-emphasizes exact ranking of the
retrieved relevant documents.
•The average is simply the mean R-Precision
for individual topics in the run.

Other measures: F-measure
F-measuretakesintoaccountboth
precisionandrecall.Itisdefinedas
harmonicmeanofrecallandprecision.
90
RP
2PR
F



Evaluation Problems
RealisticIRisinteractive;traditionalIR
methodsandmeasuresarebasedon
non-interactivesituations
91
non-interactivesituations
EvaluatinginteractiveIRrequireshuman
subjects(nogoldstandardor
benchmarks)
[Ref.:SeeBorlund,2000&2003;Borlund&Ingwersen,
1997forIIRevaluation]

Web IR : II
Tutorial on Web IR (Contd.)
92
Tutorial on Web IR (Contd.)

Web Page: Basics
are written in a tagged markup language called the Hypertext
Markup Language (HTML)
Contain hyperlinks
-Hyperlink creates a link to another web page using a uniform
resource locator (URL)
-A URL(http://www.icdim.org/program.asp) contains
93
-A URL(http://www.icdim.org/program.asp) contains
a protocol field (http)
a server hostname ( www.icdim.org)
a file path (/, the ‘root of the published file system)
are served through the Internet using Hypertext transfer
protocol (Http) to client computer
-Http is build on top of TCP (Transport control protocol)
can be viewed using browsers

IR on the Web
Input: The publicly accessible Web
Goal: Retrieve high quality pages that are relevant to
user’s need
–Static (files: text, audio, … )
–Dynamically generated on request: mostly data base
94
–Dynamically generated on request: mostly data base
access
Two aspects:
1. Processing and representing the collection
• Gathering the static pages
• “Learning” about the dynamic pages
2. Processing queries (searching)

How Web IR differs from classic IR?
The Web is:
Huge,
Dynamic
95
Dynamic
Self-organized, and
hyperlinked

How Web IR differs from classic IR?
1. Pages:
Bulk …………………… >1B (12/99)
Lack of stability……….. Estimates: 23%/day, 38%/week
Heterogeneity
–Type of documents .. Text, pictures, audio, scripts,…
96
–Type of documents .. Text, pictures, audio, scripts,…
–Quality ………………
–Language ………….. 100+
Duplication
–Syntactic……………. 30% (near) duplicates
–Semantic……………. ??
High linkage……………≥ 8 links/page in the average

The big challenge
Meet the user needs given the
heterogeneity of Web pages
97
heterogeneity of Web pages

How Web IR differs from classic IR?
2. Users
Make poor queries
–Short (2.35 terms avg)
–Imprecise terms
98
–Sub-optimal syntax (80% queries without operator)
–Low effort

How Web IR differs from classic IR?
Wide variance in
–Needs
–Knowledge
99
–Bandwidth
Specific behavior
–85% look over one result screen only
–78% of queries are not modified
–Follow links

The big challenge
Meet the user needs given the
100
Meet the user needs given the
heterogeneity of Web pages and the
poorly made queries.

Web IR : The bright side
Many tools available
Personalization
Interactivity (refine the query if needed)
101
Interactivity (refine the query if needed)

Web IR tools
General-purpose search engines:
–Direct: AltaVista, Excite, Google, Infoseek,
Lycos, ….
–Indirect (Meta-search): MetaCrawler,
102
–Indirect (Meta-search): MetaCrawler,
AskJeeves, ...
Hierarchical directories:
Yahoo!, The Best of the Web(http://botw.org)

Best of the Web
103

Web IR tools
Specialized search engines:
–Home page finder: Ahoy
–Shopping robots: Jango, Junglee,…
–School and University: searchedu.com
104
–School and University: searchedu.com
–Live images from everywhere: EarthCam
(http://www.earthcam.com)
–The Search Engines Directory
http://www.searchengineguide.com/searchengines.html
Search-by-example: Alexa’s “What’s related”,
Excite’s “More like this”

Search Engines’ Component
1.Crawler (Spider, robot, or bot)–fetches
web pages
2.Indexer --process and represents the
105
2.Indexer --process and represents the
data
3.Search interface --answers queries

Crawler: Basic Principles
A Crawler collects web pages by scanning
collected web pages for hyperlinks to other
pages that have not been collected yet.
1. Start from a given set of URLs
106
1. Start from a given set of URLs
2. fetch and scan them for new URLs (outlinks)
3. fetch these pages in turn
4. Go to step 1 (Repeat steps 1-3 for new pages)
[Ref. : Heydon & Najork, 1999; Brin & Page, 1998]

Indexing the web
In traditional IR, documents are static and non-linked.
Web documents are dynamic: documents are added,
modified or deleted.
The mappings from terms to document and positions
107
The mappings from terms to document and positions
are not constructed incrementally (Slide 19).
How to updateindex?
Web documents are linked: contain hyperlinks
How to rank documents?

Web pages are dynamic: Some
figures
40% of all webpages in their dataset
changed within a week, and 23% of the
.com pages changed daily
108
[Junghoo Cho and Hector Garcia-Molina, 2000]

Updating index : A Solution
-A static index is made which is the main index used
for answering queries
-A signed (d,t) record as (d,t,s) is maintained for
documents being added or deleted (or modified), where
s is a bit to specify if the document has been deleted or
109
s is a bit to specify if the document has been deleted or
inserted.
-A stop-press index is created using (d,t,s) record.
-Query is sent to both main index and stop-press index.
-main index returns a set of document (D
0)
-stop-press index returns two sets (D
+and D
-)
D
+is the set of documents not yet indexed
and D
-is set of document matching the query that have been
removed from the collection since D
0is constructed.

Updating index
-the retrieved set is constructed as D
0U
D
+\D
-
-When the stop-press index gets too
large, the signed (d,t,s) records are
110
large, the signed (d,t,s) records are
sorted in (d,t,s) order and merge-purged
into the master (t,d) records.
-Master index is rebuilt
-stop-press index is emptied

Index Compression Techniques
A significant portion of index space is
used by document IDs.
Delta encodingis used to save space
-Sort document IDs in increasing order
111
-Sort document IDs in increasing order
-Store first ID in full, and only gaps (i.e.
the difference from previous ID) for
subsequent entries

Delta Encoding: Example
Suppose the word harmonyappears in
document 50, 75, 110, 120, 125, 170,
200. The record forharmony is the
vector (50,25,35,10,5,45,30)
112
vector (50,25,35,10,5,45,30)

Other Issues
Spamming
Adding popular terms to pages unrelated to those terms
Titles, headings, metatags
search engines give additional weights to terms occurring in titles, headings, font modifiers
and metatgs.
Approximate string matching
113
Approximate string matching
-Soundex
-n-gram (Google suggests variant spellings of query terms based on query log using n-
gram)
Metasearch engines
Meta-search engines send the query to several search engines at once and
returnthe results from all of the search engines in one long unified list.

Search Engine Watch
http://searchenginewatch.com/showPage.h
tml?page=2156451
114

Shares of Searches: July 2006
115

Link Analysis
Link analysis is a technique that exploited the
additional information inherent in the hyperlink
structure of the Web, to improve the quality of search
results.
Link analysis or ranking algorithms underlie
116
Link analysis or ranking algorithms underlie
several of today’s most popular and successful
search engines
All major search engines combine link analysis
scores with more traditional information retrieval
scores to retrieve web pages in response to a query
PageRank and HITS are two algorithms for ranking
web pages based on links.

Google’s Search behaviour: Few
points
Google considers over a hundred factors in determining which
documents are most relevant to a query, including the
popularity of the page, the position and size of the search
terms within the page, term order and the proximity of the
search terms to one another on the page
Words in a special typeface, bold, underlined or all capitals,
117
Words in a special typeface, bold, underlined or all capitals,
get extra credit.
Google can also match multi-word phrases and sentences
Google does not disclose everything that goes into its search
index, but the cornerstone algorithm, called PageRank, is well
known.
For more information on Google's technology, visit
www.google.com/technology

Google’s Search behaviour
Google returns only pages that match
allyour search terms.
A search for [cricket match] finds pages
118
A search for [cricket match] finds pages
containing the words " cricket " and" match "

Google’s Search behaviour
Google returns pages that match your
search terms exactly.
If you search for“tv”Google won't find
119
If you search for“tv”Google won't find
“television”, if you search for “cheap”Google
won't find“inexpensive”.

Google’s Search behaviour
Google returns pages that match
variants of your search terms.
The query [ child lock] finds pages that
120
The query [ child lock] finds pages that
contain words that are similar to some or all
of your search terms, e.g., "child," "children”,
or "children's," “locks" or “locking”.

Google’s Search behaviour
Google ignores stop words
Google favors results that have your
search terms near each other.
121
search terms near each other.
(Google considers the proximity of your
search terms within a page)

Google’s Search behaviour
Google gives higher priority to pages that have
the terms in the same order as in your query.
Google is NOT case sensitive; it assumes all
search terms are lowercase
122
search terms are lowercase
Google ignores some punctuation and special
characters, including , . ; ? [ ] ( ) @ / #.
[Dr. Tanveer] returns the same results as [Dr
Tanveer ]

Google’s Search behaviour
A term with an apostrophe (single
quotes) doesn't match the term without
an apostrophe.
Google searches for variations on any
123
Google searches for variations on any
hyphenated terms.
[e-mail] matches "e-mail," "email," and
"e mail"

Google’s Search behaviour: Summary
124

Google’s PageRank
PageRank is a system for ranking web pages,
developed by the Google founders Larry Page
and Sergey Brin at Stanford University
PageRank determines the quality of a Web page
125
PageRank determines the quality of a Web page
by the pages that link to it.
Once a month, Google's spiders crawl the Web
to update and enhance the Google index
PageRank is a numeric value that represents
how important a page is on the web

Google’s PageRank
PageRank looks at a Web page and determines how
many other pages link to it (a measure of popularity).
PageRank then analyze the links to those pages.
When one page links to another page, it is
126
When one page links to another page, it is
effectively casting a vote for the other page.
The more votes that are cast for a page, the
more important the page must be.
Google calculates a page's importance from
the votes cast for it.

Google’s PageRank
PageRank of Page A, PR(A) is :
PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn)
/C(Tn) )
Where,
127
T1...Tn are pages linking to page A
d is a dampening factor, usually set to 0.85
PR(T) is the PageRank of Page T.
C(T) is the number of links going out of page T.
PR(T)/C(T) is the PageRank of Page T divided by the
number of links going out of that page.

Google’s PageRank
More simply,
page's PageRank = 0.15 + 0.85 * (a "share" of
the PageRank of every page that links to it)
"share" = the linking page's PageRank divided
128
"share" = the linking page's PageRank divided
by the number of outgoing links on the page
A page "votes" an amount of PageRank onto
each page that it links to

Google’s PageRank
The PageRank of a page that links to
yours is important but the number of
links on that page is also important.
129
The more links there are on a page, the
less PageRank value your page will
receive from it.

Google’s PageRank
page rank algorithm is applied by firstly
guessing a PageRank for all the pages that
have been indexed and then recursively
iterating until the PageRank converges.
what PageRank in effect says is that pages
130
what PageRank in effect says is that pages
"vote" for other pages on the Internet. So if
Page A links to Page B (i.e. Page A votes to
page B), it is saying that B is an important
page.
-if lots of pages link to a page, then it has
more votes and its worth should be higher

Google’s PageRank: Example
Example
Let's consider a web site with 3 page (pages A,
B and C) with no links coming in from the
outside and an initial PageRank of 1
After one iteration:
131
After one iteration:
PR(A) = (1-d) + d(......)
= 0.15
PR(B) = PR(C) = 0.15
After 10 iteration:
PR(A) = PR(B) = PR(C) = 0.15
A B
C

Example
Now, we link page A to page B and run the
calculations for each page.
We end up with:-
Page A = 0.15
A B
132
Page A = 0.15
Page B = 1
Page C = 0.15
After second iteration the figures are:-
Page A = 0.15
Page B = 0.2775
Page C = 0.15
C

Example
Now, we Link all pages to all pages.
And repeat calculations with initial
pagerank of 1.
A B
133
pagerank of 1.
We get,
Page A = 1
Page B = 1
Page C = 1
A
C
B

Example
Now remove the links between page B and page C.
after 1 iteration the results are:
PR(A) = 0.15 + 0.85(1+1) =1.85
PR(B) = 0.15 + 0.85(1/2) = 0.575
PR(C) = 0.575
and after second iteration, the results are:
A
C
B
134
and after second iteration, the results are:
PR(A) = 1.1275
PR(B) = 0.93625
PR(C) = 0.93625
after third iteration:
PR(A) = 1.741625
PR(B) = 0.629
PR(C) = 0.629
...

Calculate pagerank values for the
following:
A B
135
A
C
B

Semantic Web?
semanticstands for the meaning of.
The semantic of something is the
meaning of something.
136
Semantic Web is about describing
things in a way that computers
applications can understand.

Semantic Web
Semantic Web is not about links
between web pages.
It describes the relationships between
137
It describes the relationships between
things(like A is a part of B and Y is a
member ofZ) and the properties of
things(like size, weight, age, etc. )

Semantic Web
Semantic Web = a Web with a meaning
"If HTML and the Web made all the online
documents look like one huge book, RDF, schema,
and inference languages will make all the data in
138
and inference languages will make all the data in
the world look like one huge database"
Tim Berners-Lee, Weaving the Web, 1999
The Semantic Web uses RDF to describe web
resources.

RDF
The Resource Description Framework
RDF (Resource Description Framework) is a
markup language for describing information
and resources on the web.
139
and resources on the web.
Putting information into RDF files, makes it
possible for computer programs ("web
spiders") to search, discover, pick up, collect,
analyze and processinformation from the web.

References
Soumen Chakrabarti, “Mining the Web: Discovering knowledge from hypertext data”, Elsevier, 2003.
B. J. Jansen, A. Spink and T. Saracevic, ‘Real life, real user and real needs: A study and analysis of
user queries on the Web”. Information Processing & Management, 36(2), 207-227, 2000
G. Salton, E. A. Fox, and H. Wu, “Extended Boolean information retrieval”. Communication of the
ACM; 26(11), 1022-026, 1983.
G. Salton, and M. J. McGill, “Introduction to modern information retrieval”. New York: McGraw-Hill,
1983.
C. J. van Rijsbergen “Information Retrieval”. 2nd ed. Butterworths, London.
E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R.
140
E. A. Fox, S. Betrabet, M. Kaushik and W. Lee, “Extended Boolean model”. In W. Frakes and R.
Baeza-Yates (Eds.). Information Retrieval Data Structures and algorithms. Prentice Hall, pp. 393-
418, 1992.
H. P. Luhn, “The automatic creation of literature abstracts”. IBM journal of Research and
Development, 2(2), 1958.
C. P. Paice. “Soft evaluation of Boolean search queries in information retrieval systems”. Information
technology: Research and development, 3(1):33-42, 1984.
M. F. Porter, “An algorithm for suffix stripping. Program”, 14(3), 130-137, 1980.
A. Heydon and M. Najork: A scalable, extensible web crawler. World wide web Conference, 2(4),
pages 219-229, 1999
S. Brin and L. page. The anatomy f large-scale hypertextual Web search engine. In proceedings of the
7
th
World Wide Web Conference, 1998. decweb.ethz.ch/WWW7/1921/com1921.htm.

References
P. Borlund and P. Ingwersen, “The development of a method for the evaluation of
interactive information retrieval systems”. Journal of Documentation, 53:225–250,
1997.
P.Borlund.“Experimentalcomponentsfortheevaluationofinteractiveinformation
retrievalsystems”.JournalofDocumentation,Vol.56,No.1,2000.
P.Borlund,“TheIIRevaluationmodel:aframeworkforevaluationofinteractive
informationretrievalsystems”.Informationresearch,8(3),2003.
141
AnastasiosTombros,C.J.vanRijsbergen.“Query-sensitivesimilaritymeasuresfor
informationretrieval”.KnowledgeandInformationSystems,6,617–642,2004.
TanveerJSiddiquiandUmaShankerTiwary,“IntegratingRelationandkeywordmatching
ininformationretrieval”.InRajivKosla,RoberJ.Howlett,LakhmiC.Jain(Eds.).
Proceedingsof9thInternationalConferenceonKnowledge-BasedIntelligent
InformationandEngineeringSystems:Dataminingandsoftcomputingapplications-II,
LectureNotesinComputerScience,vol.3684,pages64-71,Melbourne,Australia,
September,2005.SpringerVerlag.
Siddiqui,TanveerJ.andTiwary,U.S.,“Ahybridmodeltoimproverelevanceindocument
retrieval.”JournalofDigitalInformationManagement,4(1),2006,p.73-81.

Resources
Text Books: Salton, Rijsberjen
Sander Dominich
Frakes & Baeza-Yates
Suggested Readings:
,
142
Robertson, Sparck-Jones, Voorhees
Myaeng, Liddy, Khoo, I. Ounis,Gelbukh, A F Smeaton, T Strzalkowski
B. J. Jansen, A. Spink
Padmini Srinivasan, M Mitra, A Singhal
T. Saracevic, D. Harman,P. Borlund(evaluation/relevance)

Resources
Information Retrieval
Information Processing & Management
JASIS (J. of American Soc. for Info. Science)
TOIS(ACM trans. On Information System)
143
TOIS(ACM trans. On Information System)
Information Research (Online)
Proceedings of SIGIR/TREC conferences
Jou. of Digital Information Management
KAIS (Springer)

Thank You
144

ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

ICDIM 06 Web IR Tutorial [Compatibility Mode].pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Slide 60

Slide 61

Slide 62

Slide 63

Slide 64

Slide 65

Slide 66

Slide 67

Slide 68

Slide 69

Slide 70

Slide 71

Slide 72

Slide 73

Slide 74

Slide 75

Slide 76

Slide 77