unit 5 WEB RETRIEVAL AND WEB CRAWLING

karthiksmart21 81 views 53 slides Sep 27, 2024
Slide 1
Slide 1 of 53
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53

About This Presentation

• To understand various search engine system operations.


Slide Content

UNIT-IV IV Year / VIII Semester By K.Karthick AP/CSE KNCET. KONGUNADU COLLEGE OF ENGINEERING AND TECHNOLOGY (Autonomous) NAMAKKAL- TRICHY MAIN ROAD, THOTTIAM DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CS8080 – Information Retrieval Techniques

Syllabus

The Web A Brief History In 1990, Berners-Lee Wrote the HTTP protocol Defined the HTML language Wrote the first browser, which he called World Wide Web Wrote the first Web server In 1991, he made his browser and server software available in the Internet. The Web was born! Since its inception, the Web became a huge success Well over 20 billion pages are now available and accessible in the Web More than one fourth of humanity now access the Web on a regular basis

How the Web Changed Search Web search is today the most prominent application of IR and its techniques—the ranking and indexing components of any search engine are fundamentally IR pieces of technology The first major impact of the Web on search is related to the characteristics of the document collection itself The Web is composed of pages distributed over millions of sites and connected through hyperlinks This requires collecting all documents and storing copies of them in a central repository, prior to indexing This new phase in the IR process, introduced by the Web, is called crawling

The second major impact of the Web on search is related to: The size of the collection The volume of user queries submitted on a daily basis As a consequence, performance and scalability have become critical characteristics of the IR system The third major impact: in a very large collection, predicting relevance is much harder than before Fortunately, the Web also includes new sources of evidence Ex: hyperlinks and user clicks in documents in the answer set

The fourth major impact derives from the fact that the Web is also a medium to do business Search problem has been extended beyond the seeking of text information to also encompass other user needs Ex: the price of a book, the phone number of a hotel, the link fo r downloading a software The fifth major impact of the Web on search is Web spam Web spam: abusive availability of commercial information disguised in the form of informational content This difficulty is so large that today we talk of Adversarial Web Retrieval

Web structure The web graph We can view the static Web consisting of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directed edge. We refer to the hyperlinks into a page as in-links and those out of a page as out-links . The number of in-links to a page (also known as its in-degree ) has averaged from roughly 8 to 15, in a range of studies. We similarly define the out-degree of a web page to be the number of links out of it. These notions are represented in Figure 4.1. The web can be represented as a graph: • Webpage ≡ node • hyperlink ≡ directed edge • # in-links ≡ in-degree of a node • # out-links ≡ out-degree of a node

Search Engine Architectures A software architecture consists of software components, the interfaces provided by those components, and the relationships between them describes a system at a particular level of abstraction Architecture of a search engine determined by 2 requirements effectiveness (quality of results) and efficiency (response time and throughput)

Each Search Engine architecture has its own components. The architecture of search engines falls into the following categories: Basic architecture- Centralized crawler Cluster based architecture Distributed Architectures Multi site Architecture

Basic architecture- Centralized crawler used by most engines crawlers are software agents that traverse the Web copying pages pages crawled are stored in a central repository and then indexed index is used in a centralized fashion to answer queries most search engines use indices based on the inverted index only a logical, rather than a literal, view of the text needs to be indexed Normalization operations removal of punctuation substitution of multiple consecutive spaces by a single space conversion of uppercase to lowercase letters Some engines eliminate stopwords to reduce index size Index is complemented with metadata associated with pages Creation date, size, title, etc.

Given a query 10 results shown are subset of complete result set if user requests more results, search engine can recompute the query to generate the next 10 results obtain them from a partial result set maintained in main memory In any case, a search engine never computes the full answer set for the whole Web

Main problem faced by this architecture gathering of the data, and sheer volume Crawler-indexer architecture could not cope with Web growth (end of 1990s) solution: distribute and parallelize computation

Cluster-based Architecture Current engines adopt a massively parallel cluster-based architecture document partitioning is used replicated to handle the overall query load cluster replicas maintained in various geographical locations to decrease latency time (for nearby users) Many crucial details need to be addressed good balance between the internal and external activities good load balancing among different clusters fault tolerance at software level to protect against hardware failures

Caching Search engines need to be fast whenever possible, execute tasks in main memory caching is highly recommended and extensively used provides for shorter average response time significantly reduces workload on back-end servers decreases the overall amount of bandwidth utilized In the Web, caching can be done both at the client or the server side

Caching of answers the most effective caching technique in search engines query distribution follows a power law small cache can answer a large percentage of queries with a 30% hit-rate, capacity of search engine increases by almost 43% Still, in any time window a large fraction of queries will be unique hence, those queries will not be in the cache 50% in Baeza -Yates et al this can be improved by also caching inverted lists at the search cluster level

Distributed Architectures There exist several variants of the crawler-indexer architecture We describe here the most important ones most significant early example is Harvest among newest proposals, we distinguish the multi-site architecture proposed by Baeza -Yates et al Harvest Harvest uses a distributed architecture to gather and distribute data interestingly, it does not suffer from some of common problems of the crawler-indexer architectures, such as increased servers load caused by reception of simultaneous requests from different crawlers increased Web traffic, due to crawlers retrieving entire objects, while most content is not retained eventually lack of coordination between engines, as information is gathered independently by each crawler

To avoid these issues, two components are introduced: gatherers and brokers gatherer: collects and extracts indexing information from one or more Web servers broker: provides indexing mechanism and query interface to the data gathered

Multi-site Architecture As the document collection grows capacity of query processors has to grow as well unlikely that growth in size of single processors can match growth of very large collections even if a large number of servers is used main reasons are physical constraints such as size of single data center and power and cooling requirements

Distributed resolution of queries using different query processors is a viable approach enables a more scalable solution but also imposes new challenges one such challenge is the routing of queries to appropriate query processors to utilize more efficiently available resources and provide more precise results factors affecting query routing include geographical proximity , query topic, or language of the query

Geographical proximity: reduce network latency by using resources close to the user posing the query Possible implementation is DNS redirection according to IP address of client, the DNS service routes query to appropriate Web server usually the closest in terms of network distance As another example, DNS service can use the geographical location to determine where to route queries to There is a fluctuation in submitted queries from a particular geographic region during a day possible to offload a server from a busy region by rerouting some queries to query servers in a less busy region

Search Engine Ranking Ranking is the hardest and most important function of a search engine A key first challenge devise an adequate process of evaluating the ranking, in terms of relevance of results to the user without such evaluation, it is close to impossible to fine tune the ranking function without fine tuning the ranking, there is no state-of-the-art engine—this is an empirical field of science A second critical challenge identification of quality content in the Web

Evidence of quality can be indicated by several signals such as: domain names text content links (like Page Rank) Web page access patterns as monitored by the search engine Additional useful signals are provided by the layout of the Web page, its title, metadata, font sizes, etc.

Third critical challenge avoiding, preventing, managing Web spam spammers are malicious users who try to trick search engines by artificially inflating signals used for ranking a consequence of the economic incentives of the current advertising model adopted by search engines A fourth major challenge defining the ranking function and computing it

Link-based Ranking Number of hyperlinks that point to a page provides a measure of its popularity and quality Many links in common among pages are indicative of page relations with potential value for ranking purposes Examples of ranking techniques that exploit links are discussed next

Early Algorithms Use incoming links for ranking Web pages But, just counting links was not a very reliable measure of authoritativeness Yuwono and Lee proposed three early ranking algorithms (in addition to the classic TF-IDF vector ranking) Boolean spread given page p in result set extend result set with pages that point to and are pointed by page p Vector spread given page p in result set extend result set with pages that point to and are pointed by page p most-cited a page p is assigned a score given by the sum of number of query words contained in other pages that point to page p WebQuery is an early algorithm that allows visual browsing of Web pages

Page Rank The basic idea is that good pages point to good pages Let p, r be two variables for pages and L a set of links Is the link analysis method that uses link structure to rank the web pages by search engines PAGERANK: the web graphs a numerical score between 0 and 1, known as its Page Rank. The Page Rank of a node will depend on the link structure of the web graph Page Rank: scoring measure based only on the link structure of web pages Every node in the web graph is given a score between 0 and1, depending on its in and out-links Given a query, a search engine combines the Page Rank score with other values to compute the ranked retrieval (e.g. cosine similarity, relevance feedback, etc.)

The PageRank Citation Ranking: Bring Order to the web Why is Page Importance Rating important? New challenges for information retrieval on the World Wide Web. Huge number of web pages: 150 million by1998 1000 billion by 2008 Diversity of web pages: different topics, different quality, etc. What is PageRank? A method for rating the importance of web pages objectively and mechanically using the link structure of the web.

The History of PageRank PageRank was developed by Larry Page (hence the name Page -Rank) and Sergey Brin . It is first as part of a research project about a new kind of search engine. That project started in 1995 and led to a functional prototype in 1998. Shortly after, Page and Brin founded Google. 16 billion…

There are some news about that PageRank will be canceled by Google. There are large numbers of Search Engine Optimization (SEO). SEO use different trick methods to make a web page more important under the rating of PageRank.

150 million web pages  1.7 billion links Backlinks and Forward links: A and B are C’s backlinks C is A and B’s forward link

Underlying idea: computing a score reflecting the “ visitability ” of a page When surfing the web, a user may visit some pages more often than others (e.g. more in-links) Pages that are often visited are more likely to contain relevant information When a dead-end page is reached, the user may teleport (e.g. type an address in the browser)

Page Rank: Strengths & Weaknesses Strengths Elegant theoretical foundation with nice interpretations (e.g., stochastic matrices, principal Eigenvectors, stationary state probabilities in random walk model/ Markov-chain process) Can be extended to topic-based and personalized models Static measure (i.e., can be precomputed ) Relatively straight-forward to scale PageRank scores are relatively stable to graph perturbations Weaknesses Rank is not stable to perturbations

HITS Hypertext Induced Topic Selection (HITS) is a link analysis method developed by John Kleinberg in 1999 using Hub and Authority scores. considers set of pages S that point to or are pointed by pages in answer set pages that have many links pointing to it are called authorities pages that have many outgoing links are called hubs positive two-way feedback better authority pages come from incoming edges from good hubs better hub pages come from outgoing edges to good authorities

Start with a “root set” of pages relevant to a topic (or information need) Create a “base sub graph” by adding to the root set –All pages that have incoming links from pages in the root set –For each page in the root set: up to 𝑘 pages pointing to it Compute authority and hubness scores for all pages in the base sub graph Rank pages by decreasing authority scores

Let H(p): hub value of page p A(p): authority value of page p H(p) and A(p) are defined such that

Simple Ranking Functions use a global ranking function such as PageRank In that case, quality of a Web page in the result set is independent of the query the query only selects pages to be ranked use a linear combination of different ranking signals for instance, combine BM25 (text-based ranking) with PageRank (link-based ranking) To illustrate, consider the pages p that satisfy query Q Rank score R( p,Q ) of page p with regard to query Q can be computed as R( p,Q ) = BM25( p,Q ) + (1- ) PR(p)  
Tags