Graph Mining_Module-3_CS7 (PageRank).pptx

GururajaHebburSatyan 17 views 22 slides Jun 23, 2024
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

Graph Mining_Module-3_CS7 (PageRank).pptx


Slide Content

Seed in Pseudo-Random Number Generation

PageRank

The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes . The World Wide Web creates many new challenges for information retrieval. It is very large and heterogeneous . Current estimates are that there are over 150 million web pages with a doubling life of less than one year. the World Wide Web is hypertext and provides considerable auxiliary information on top of the text of the web pages, such as link structure and link text. The PageRank Citation Ranking Lawrence Page et al. “ The PageRank Citation Ranking: Bringing Order to the Web”. 1999 Technical Report. Stanford InfoLab .

There has been a great deal of work on academic citation analysis. Kleinberg [Kle98] has developed an interesting model of the web as Hubs and Authorities, based on an eigenvector calculation on the co-citation matrix of the web . This fact that the Yahoo home page has so many backlinks generally imply that it is quite important. Indeed, many of the web search engines have used backlink count as a way to try to bias their databases in favour of higher quality or more important pages . A Ranking for Every Page on the Web Jon Kleinberg. “Authoritative sources in a hyperlinked environment” . l ACM-SIAM Symposium on Discrete Algorithms., 1998

We can never know whether we have found all the backlinks of a particular page but if we have downloaded it, we know all of its forward links at that time. Web pages vary greatly in terms of the number of backlinks they have. Generally, highly linked pages are more " important " than pages with few links. if a web page has a link of the Yahoo home page, it may be just one link but it is a very important one. This page should be ranked higher than many pages with more links but from obscure places. PageRank is an attempt to see how good an approximation to “ importance " can be obtained just from the link structure . A Ranking for Every Page on the Web Jon Kleinberg. “Authoritative sources in a hyperlinked environment” . l ACM-SIAM Symposium on Discrete Algorithms., 1998

Our main task is rank websites in their search result. There are two widely used algorithms to rank them based on the popularity. HITS (Hypertext induced topic search) PageRank algorithm. The WWW hyperlink structure forms a huge directed graph where the nodes represent the web pages & directed edges as the hyperlinks PageRank Algorithm Lawrence Page et al. “ The PageRank Citation Ranking: Bringing Order to the Web”. 1999 Technical Report. Stanford InfoLab .

Example Web Graph A B C F D H E G J I

The Main concept of PageRank algorithm is: it works by counting the number of web-links pointing towards it, along with their individual importance, to calculate the approximate importance of the current web page. The underlying assumption is “ more important websites are likely to receive more links from other websites. ” * Importance of Inbounds Links α Importance of current webpage and not purely on indegree . PageRank Algorithm Lawrence Page et al. “ The PageRank Citation Ranking: Bringing Order to the Web”. 1999 Technical Report. Stanford InfoLab .

PageRank Algorithm Source: Wikipedia

Terminologies: Inbound links: The links pointing to the given site from the other sites ( f rom outside). Outbound Links: The links from the given page to pages in the same site or other sites (to outside). PageRank Algorithm Lawrence Page et al. “ The PageRank Citation Ranking: Bringing Order to the Web”. 1999 Technical Report. Stanford InfoLab .

Terminologies: Dangling Links: The links that points to such pages, which are not pointing to anywhere else. Often these dangling links are simply pages that we have not downloaded yet, since it is hard to sample the entire web As dangling links do not affect the ranking of any other page directly, we simply remove them from the system until all the PageRanks are calculated. After all the PageRanks are calculated, they can be added back in, without affecting things significantly. PageRank Algorithm Lawrence Page et al. “ The PageRank Citation Ranking: Bringing Order to the Web”. 1999 Technical Report. Stanford InfoLab .

Web crawling to know the structure of WWW: BFS Mostly used for web crawling DFS: Not useful in our case of PageRank calculations. ( WHY? ) *Google requires 28 days to complete the web crawling. But run a separate crawler for news (once in an hour, or few hours, or once a day). DIAGRAM PageRank Algorithm Lawrence Page et al. “ The PageRank Citation Ranking: Bringing Order to the Web”. 1999 Technical Report. Stanford InfoLab .

The original formula: PR(A) is the page rank of A It is a recursive formula as its PageRank depends upon the PageRank of others. PR(T 1 ) is the PageRank of T 1 which link to page A. C(T 1 ) is the number of outbound links on a given T 1 page. d is the damping factor, (   PageRank Algorithm Lawrence Page et al. “ The PageRank Citation Ranking: Bringing Order to the Web”. 1999 Technical Report. Stanford InfoLab .  

The original formula: We Initialize the PageRank of the all the webpages by We iterate until the PageRank converges .(will explained at the last) So Lets solve an example without damping factor .   PageRank Algorithm Lawrence Page et al. “ The PageRank Citation Ranking: Bringing Order to the Web”. 1999 Technical Report. Stanford InfoLab .  

  Lawrence Page et al. “ The PageRank Citation Ranking: Bringing Order to the Web”. 1999 Technical Report. Stanford InfoLab . PageRank Solution.xlsx A B C F D H E G J I

Matrix Representation : ( Power Method ) We can use matrix operation in place of iterative approach. We can update column one by one, but multiple calculation at the same time. PageRank Solution.xlsx PageRank Algorithm Node A B C D E F G H I J A 0.00 0.50 0.50 0.50 0.00 0.00 0.00 0.00 0.00 0.00 B 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.00 0.33 0.00 C 0.00 0.50 0.00 0.00 0.00 0.33 0.00 0.00 0.00 0.00 D 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 E 0.00 0.00 0.00 0.50 0.00 0.00 0.00 0.00 0.00 0.00 F 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 G 0.00 0.00 0.00 0.00 0.00 0.33 0.00 0.50 0.33 0.00 H 0.00 0.00 0.00 0.00 0.00 0.33 0.00 0.00 0.00 0.00 I 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00 0.50 J 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.33 0.00   H is the transition matrix of the hyperlinks A B C F D H E G J I

Matrix Representation: ( Power Method ) This is also an iterative method till the values converges.   PageRank Algorithm

Steady State Approach: From a directed graph, we can construct transition matrix. This is steady state approach as the sum of all eigenvalues is 1. With Hx = H is the transition matrix of the hyperlinks. x is the final PageRank of the given pages. PageRank Algorithm

Random Surfer Approach: PageRank can be defined by the probability that a random surfer on the web starts on a random page, follows the hyperlinks and visit the given page. Sum of column values = 1. It is a “ M arkov chain ”. The transition matrix defines the next step. Stationary distribution: is the final PageRank vector. Hx = x PageRank Algorithm

The algorithm discussed so far does not work for dangling nodes. The algorithm does not work for disconnected, independent components. So the “ damping factor” is used to deal with above mentioned problems. M = (1-d)B + dA M is PageRank matrix. D is the damping factor (≈ 0.15). A is transition matrix. B is Identity Matrix. n is the number of webpages or websites in world wide web.   Problem with current algorithm. (1-d) is the probability that the surfer will follow the links in the given webpage. So this implies that it will search in the neighbourhood of the current webpage. With small probability the surfer can leave the current page and navigate to some totally new webpage. It is termed as “ Teleportation ”. PageRank Solution.xlsx

The property of Markov matrix is: Every column sums to 1. The Highest Eigenvalue is equals to 1, and every other eigenvalue is less then 1. Highest Eigenvalue is 1, so all its powers is 1, and all the rest eigenvalue will become zero. So   Why it converge

D. Cook, L. Holder, Mining Graph Data , John Wiley & Sons Inc , 2007 . Agarwal Charu C. and Wang Haixun , “Managing and Mining Graph Data ” , 2010. Deepayan Chakrabarti ; Christos Faloutsos , " Graph Mining: Laws, Tools, and Case Studies ," Morgan & Claypool, 2012. Reference
Tags