Statistics
•In 1994, one of the first web search
engines, the World Wide Web Worm
(WWWW), had an index of 110,000 web
pages and web accessible documents.
•Up to 09/2005 Google indexes
8,200,000,000 web pages.
Search Engines
•A search engine is a system which
collects, organizes & presents a way to
select Web documents based on certain
words, phrases, or patterns within
documents
–Model the Web as a full-text DB
–Index a portion of the Web docs
–Search Web documents using user-specified
words/patterns in a text
Search Engines
•Two categories of search engine
–general-purpose search engine, e.g. Yahoo!,
AltaVista and Google
–special-purpose search engines (or Internet
Portals), e.g. LinuxStart (www.linuxstart.com)
Search Engines
•Two main components of a search engine:
–web crawler (spider), which collects massive Web
pages.
–large database, which stores and indexes collected
Web pages.
•Ranking has to be performed without accessing
the text, just the index
•Ranking algorithms: all information is “top
secret;” it is almost impossible to measure recall
as the number of relevant pages can be quite
large for simple queries
Search Engine
•Information Retrieval (IR) is a key to
search engine or Web Search.
•Most commonly – used models:
–Boolean Model
–Vector Space Model (VSM)
–Probability Model
–their variations
Search Engine
•The Google is a representative general-purpose search
engine (successful, takes many techniques such as use of
both web page link structure and text content, which fully
takes advantage of the web page features).
•The PageRank in Google is defined as follow:
Assume page A has pages P
1
...P
n
which point to it. The
parameter d is a damping factor which can be set between 0
and 1. Also C(P
i
) is defined as the number of links going out
of page P
i
. The PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(P
1
)/C(P
1
) + ... + PR(P
n
)/C(P
n
))
Usually the parameter d is set to 0.85. PageRank or PR(A)
can be calculated using a simple iterative algorithm.
•Other features: anchor text processing, location information
management and various data structures, which fully make
use of the features of the web.
Ranking result pages
•Based on content
–Based on content
–Number of occurrences of the search terms
•Based on link structure
–Backlink count
–PageRank
•And more.
(http://www.cs.duke.edu/~junyang/courses/cps296.1-2002-spring/)
Problems with content-based
ranking
•Many pages containing search terms may be of
poor quality or irrelevant
–Example: a page with just a line “search engine”.
•Many high-quality or relevant pages do not even
contain the search terms
–Example: Google homepage
•Page containing more occurrences of the search
terms are ranked higher; spamming is easy
–Example: a page with line “search engine” Repeated
many times
Based on link structure
•Hyperlinks among web pages provide new
web search opportunities.
•Our focus in this lecture
–PageRank
–HITS
Backlink
•A backlink of a page p is a link that points to p
•A page with more backlinks is ranked higher
•Intuition: Each backlink is a “vote” for the page’s
importance
•Based on local link structure; still easy to spam
–Create lots of pages that point to a particular page
PageRank
•Page et al., "The PageRank Citation Ranking: Brining Order to the
Web.", 1998
•Main idea: Pages pointed by high-ranking
pages are ranked higher
•Definition is recursive by design
•Based on global link structure; hard to
spam
PageRank
•Web can be viewed as a huge directed graph G(V, E)
–where V is the set of web pages (vertices) and E is the set of
hyperlinks (directed edges).
•Each page may have a number of outgoing edges
(forward links) and a number of incoming links (backlinks).
•Each backlink of a page represents a citation to the page.
•PageRank is a measure of global web page importance
based on the backlinks of web pages.
Basic ideas
•A page is likely to be important if it is linked by
many pages.
•A page is likely to be important If it is linked to by
important pages
–even though it may not necessarily be linked by a large
number of pages.
•The importance of a page is divided evenly and
propagated to the pages pointed to by it.
8
4
4
PageRank Definition
•N(p): number of outgoing links from page p
•B(p): set of pages that point to p
•PageRank(p) = Σq∈B(p) (PageRank(q) / N(q))
•Intuition
–Each page q evenly distributes its importance to all
pages that q points to
–Each page p gets a boost of its importance from each
page that points to p
Computing PageRank
•Create a stochastic matrix M for the link structure
–Each page i corresponds to row i and column i
–If page j has n outgoing links, then the M(i, j) = 1 Ú n if page j
•Solve by setting all PageRank.s to 1 initially, and
thenapplying M repeatedly until the values converge
ú
ú
ú
ú
û
ù
ê
ê
ê
ê
ë
é
=
ú
ú
ú
ú
û
ù
ê
ê
ê
ê
ë
é
)PageRank(p
...
)PageRank(p
)PageRank(p
)PageRank(p
...
)PageRank(p
)PageRank(p
m
2
1
m
2
1
M
Convergence
•If the ranks converge, i.e., there is a rank
vector R such that R
= M ´ R, R is the
eigenvector of matrix M with eigenvalue
being 1.
•Convergence is guaranteed only if
–M is aperiodic (the Web graph is not a big
cycle). This is practically guaranteed for Web.
–M is irreducible (the Web graph is strongly
connected). This is usually not true.
Example
p1
p2 p3
r1
r3
r2
0.5 00.5
0 0 0.5
0.5 10
=
r1
r3
r2
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
=
ú
ú
ú
û
ù
ê
ê
ê
ë
é
2.1
6.0
2.1
,...,
0625.1
6875.0
25.1
,
1
75.0
25.1
,
5.1
5.0
1
,
1
1
1
2
3
1
r
r
r
The sum of all PageRank’s is the total
number of pages
Random surfer model
•A random surfer
–Starts with a random page
–Randomly selects a link on the page to visit
next
–Never uses the .back. button
•PageRank(p) measures the probability
that a
–random surfer visits page p
Rank leak
p1
p2 p3
r1
r3
r2
0.5 00.5
0 0 0.5
0.5 00
=
r1
r3
r2
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
=
ú
ú
ú
û
ù
ê
ê
ê
ë
é
0
0
0
,...,
3125.0
1875.0
5.0
,
375.0
25.0
625.0
,
5.0
25.0
75.0
,
5.0
5.0
1
,
1
1
1
2
3
1
r
r
r
• Pages with no outgoing links
• Called a rank leak because eventually all importance
will leak. out of the Web
Rank sink
•A page or a group of pages is a rank sink if
they can receive rank propagation from its
parents but cannot propagate rank to other
pages.
•Causes the loss of total ranks.
•Eventually accumulate all importance of the
Web
An example of rank sink
p1
p2 p3
r1
r3
r2
0.5 00.5
0 1 0.5
0.5 00
=
r1
r3
r2
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
ú
ú
ú
û
ù
ê
ê
ê
ë
é
=
ú
ú
ú
û
ù
ê
ê
ê
ë
é
0
3
0
,...,
3125.0
1875.2
5.0
,
375.0
2
625.0
,
5.0
75.1
75.0
,
5.0
5.1
1
,
1
1
1
2
3
1
r
r
r
Solution
•d: damping factor
•PageRank(p) =d · Σq∈B(p) (PageRank(q)
/N(q)) + (1 -d )
•Intuition in the random surfer model
–A surfer occasionally gets bored and jump to
a random page on the Web instead of
following a random link
•Another intuition
–Make the graph fully connected
An example
p1
p2 p3
ú
ú
ú
û
ù
ê
ê
ê
ë
é
=
ú
ú
ú
û
ù
ê
ê
ê
ë
é
11/5
11/21
11/7
,...,
2
3
1
r
r
r
r1
r3
r2
0.5 00.5
0 1 0.5
0.5 00
=
r1
r3
r2
+
0.2
0.2
0.2
the sum of all PageRank’s is still the total
number of pages.