Search Engines

1,579 views 31 slides Jul 02, 2019
Slide 1
Slide 1 of 31
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31

About This Presentation

Gives the basic knowledge about search engines and how they work


Slide Content

7/3/2019 Compiled by: Kamal Acharya 1

7/3/2019 Compiled by: Kamal Acharya 2 Introduction Most libraries have a relatively small collection of documents and a catalogue to search documents. The web on the other hand is a very large collection of documents. The search engines allow a user to carry out the task of searching the web for information. The Google dominates the search engine market(about 75%).

7/3/2019 Compiled by: Kamal Acharya 3 Differences between web search and information retrieval A web search is very different than a normal information retrieval search of a document because of the following factors: Bulk: web is much larger than any set of documents used by IR. Diversity: web pages may contains text, tables, image, video, audio instead of only text. Growth: exponential growth of web.

7/3/2019 Compiled by: Kamal Acharya 4 Contd.. Dynamic : web changes significantly with time but text does not. Demanding Users : users demand immediate results. Quality of document: Text documents are usually of high quality but web documents may not. Hyperlinks: very important components of web documents Queries: web queries are short and ambiguous.

7/3/2019 Compiled by: Kamal Acharya 5 Search engine A search engine is defined as program that searches for documents for specified keyword and returns a list of the documents where the keywords are found. A search engine consists of the following main components: Crawler(spider) Indexer Search engine user interface

7/3/2019 Compiled by: Kamal Acharya 6 Contd… A typical search engine architecture is as shown in figure below

7/3/2019 Compiled by: Kamal Acharya 7 Contd.. How search engine works? A search engine operates in the following order: 1. Web crawling 2. Indexing 3. Searching

7/3/2019 Compiled by: Kamal Acharya 8 Contd.. Web Crawling: Search engine has a huge databases of web pages . Such databases are built and updated automatically by the web crawler. The web crawler performs web crawling as follows: The crawler begins with one or more URLs that constitute a URL set. It picks a URL from this URL set, and then fetches the web page at that URL. The fetched page is then parsed to extract both the text and the links from the page. The extracted links (URLs) are then added to a URL set. The extracted text is fed to a text indexer.

7/3/2019 Compiled by: Kamal Acharya 9 Contd.. Indexing: The indexer module of the search engine is responsible for indexing the extracted text supplied by the web crawler. Most commonly used indexing is the inverted indexing

7/3/2019 Compiled by: Kamal Acharya 10 Contd..

7/3/2019 Compiled by: Kamal Acharya 11 Contd.. Searching: When a user enters a query to the search engine, user is not searching the entire web. Instead user is only searching the database that has been compiled by the search engine. The user’s query is parsed into the words by the query parser. Such parsed words are matched with the words in the inverted list of indexed documents. The matched list of documents are returned to the user with ranking.

7/3/2019 Compiled by: Kamal Acharya 12 Characteristics of search engines Features a search engine must provide: Robustness: search engine must be distributed over large number of machine to deal search engine failure due to the machine failure. Politeness: Web servers have policies regulating the rate at which a search engine can visit them. These politeness policies must be respected.

7/3/2019 Compiled by: Kamal Acharya 13 Contd.. Features a search engine should provide Distributed: The search should have the ability to execute in a distributed fashion across multiple machines. Scalable: The search engine architecture should permit scaling up the search rate by adding extra machines .

7/3/2019 Compiled by: Kamal Acharya 14 Contd.. Performance and efficiency : The search system should make efficient use of various system resources including processor, storage and network. Quality: Given that a significant fraction of all web pages are of poor utility for serving user query needs, the search engine should be biased towards fetching “useful” pages first.

7/3/2019 Compiled by: Kamal Acharya 15 Contd.. Freshness: it should obtain fresh copies of previously fetched pages. Extensible: Crawlers should be designed to be extensible in many ways – to cope with new data formats, new fetch protocols, and so on. This demands that the crawler architecture be modular.

7/3/2019 Compiled by: Kamal Acharya 16 Problems with search using search engines Specifying query keywords can be challenging: Search result get affected by structure of the query phrase. Due to the nature of English language search result may get affected e.g., current.

7/3/2019 Compiled by: Kamal Acharya 17 Contd.. Difficult for the search engine to be certain about what users want. Some may be seeking destination While others may want only a small number of highly relevant result. Diversity of search engine and web users Young to old A search engine is therefore attempting to meet the needs of a diverse group of users.

7/3/2019 Compiled by: Kamal Acharya 18 The goals of web search Depending on the nature of search engine queries, the information needs of user may be divided into three classes: Navigational Informational Transactional

7/3/2019 Compiled by: Kamal Acharya 19 Contd.. Navigational: To reach a website that the user has in mind. The user may know the site exists but or may have visited the site earlier but does not know the site URL. Informational: To find a website that provides useful information about a topic of interest. The user may not have a particular website in mind. Transactional: To go to a site to perform some kind of transaction. E.g., buy a book

7/3/2019 Compiled by: Kamal Acharya 20 Quality of search result The quality of search results from a search engine ideally should satisfy the following requirements: Precision: precision indicates what percentage of documents retrieved are relevant? So , only relevant documents should be returned. Recall: means what percentage of relevant documents is retrieved from total relevant documents in the web So, all relevant document should be returned

7/3/2019 Compiled by: Kamal Acharya 21 Contd.. Ranking: A ranking of the documents providing some indication of the relative relevance of the results should be returned. First screen : The first page of results should include the most relevant results. Speed: Results should be provided quickly.

7/3/2019 Compiled by: Kamal Acharya 22 Search engine functionality A search engine is a complex collection of software modules. A search engines carries out a variety of tasks: Collecting information Evaluating and categorizing information Creating a database and creating indexes Computing ranks of the web documents Checking queries and executing them Presenting results Profiling the users

7/3/2019 Compiled by: Kamal Acharya 23 Contd.. Collecting information: A search engine would normally collect web pages or information about them by web crawling. Evaluating and categorizing information: search engine evaluates the pages before submission and categorize the information. Creating a database and creating indexes: The information collected needs to be stored either in a database or some kind of file system. Indexes must be created so that the information may be searched efficiently.

7/3/2019 Compiled by: Kamal Acharya 24 Contd.. Computing ranks of the web documents: rank the web pages before returning as a response to the user queries. Checking queries and executing them: queries posed by users need to be checked , for example, for spelling errors. Once checked, a query is executed by searching the search engine database. Presenting results: search engine determine what results to present and how to display them Profiling the user: To improve the search performance the search engines carry out user profiling that deals with the way users use search engines.

7/3/2019 Compiled by: Kamal Acharya 25 Page Ranking The web consists of a huge number of documents that have been published without any quality control. The page ranking is a method for determining the relative importance and quality of the page for a given query. The most well known ranking algorithm is the page rank algorithm.

7/3/2019 Compiled by: Kamal Acharya 26 Contd.. Page rank algorithm: Was developed by Larry page at Stanford university. A hyperlink to a page counts as a vote of support . A page that is linked to by many pages receives a high rank and if there is no links to a web page there is no support for that page so, get low rank.

7/3/2019 Compiled by: Kamal Acharya 27 Contd.. Assigns to every node in the web graph a numerical score between 0 and 1 to each element of hyperlinked set of documents. The rank value indicates the importance of a particular page. A page rank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to the document with 0.5 page rank.

7/3/2019 Compiled by: Kamal Acharya 28 Contd.. Algorithm with illustrative example: Assume a small universe of four web pages A, B, C and D. The initial approximation of Page Rank would be evenly divided between the four documents. Hence each document would begin with an estimated Page Rank of 0.25. If pages B, C and D each only link to A, they would each confer 0.25 page rank to A. i.e. PR(A) = PR(B) + PR(C) + PR(D) = 0.75

7/3/2019 Compiled by: Kamal Acharya 29 Contd.. Suppose that page B has link to page C as well as to page A, while pages D has links to all three pages and page C has link to A. The value of link votes is divided among all the outbound links on the page. Thus B gives vote worth 0.125 to page A and a vote 0.125 to page C. Similarly, D’s page rank is 0.083 (approximately) i.e. PR(A) = PR(B)/2 + PR(C)/1 + PR(D)/3 Where L(page) = number of outbound links B m = set of all pages link to page m

7/3/2019 Compiled by: Kamal Acharya 30 Home work What is search engine? Explain the various components of search engine architecture. What is the role of crawler and indexer? Explain the different search engine functionality. What are the primary goals of web search? Describe the page rank algorithm. Using an example, show how it works. How is web search different than text retrieval?

Thank You ! Compiled by: Kamal Acharya 31 7/3/2019