The Anatomy of a Large-Scale Hypertextual Web Search Engine

MehulBoricha 137 views 20 slides Feb 17, 2019

Slide 1 of 20

About This Presentation

It explains the basics of Search Engine and How it Works (Google). It covers the following topics:

+ Introduction
+ Challenges
+ Design Goals
+ PageRank
+ Google Architecture Overview
+ Major Data Structures

Size: 352.08 KB

Language: en

Added: Feb 17, 2019

Slides: 20 pages

Slide Content

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Introduction D esigned to crawl and index the Web efficiently P roduce much more satisfying search results V ery little a cademic r esearch

Introduction W eb is growing rapidly Also, new users inexperienced in the art of web research Human maintained lists cover popular topics are efficient but expensive

Introduction Automated search engines that rely on keyword matching R eturn too many low quality matches A dvertisers attempt to mislead automated search engines

Challenges Fast crawling technology is needed Storage space must be used efficiently The indexing system must process hundreds of gigabytes of data efficiently Queries must be handled quickly

Design Goals Improved Search Quality - We need tools that have very high precision Academic Search Engine Research - Build systems that reasonable numbers of people can actually use

System Features: PageRank Use of Maps PageRank to prioritize web search result The probability that the random surfer visits a page is its PageRank. And, the d damping factor is the probability at each page the "random surfer" will get bored & request another random page.

System Features: PageRank We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.

Google Architecture Overview

Major Data Structures BigFiles - Virtual files spanning multiple file systems - Addressable by 64 bit integers - Multiple file systems is handled automatically - Also support rudimentary compression options

Major Data Structures Repository - Contains the full HTML of every web page. - Each page is compressed using zlib (zlib vs bzip) - The documents are stored one after the other - Prefixed by docID, length, and URL

Major Data Structures Repository

Major Data Structures Document Index - Document index: information about each document. - Fixed width ISAM (Index sequential access mode) index, ordered by docID. - Each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. - If the document has been crawled, it also contains a pointer into a variable width file called docinfo which contains its URL and title.

Major Data Structures Document Index - Otherwise the pointer points into the URLlist which contains just the URL. - This design decision was driven by the desire to have a reasonably compact data structure, and the ability to fetch a record in one disk seek during a search

Major Data Structures Lexicon - It is implemented in two parts -- a list of the words (concatenated together but separated by nulls) and a hash table of pointers.

Major Data Structures Hit Lists - A list of occurrences of a particular word in a particular document including position, font, and capitalization information. - Account for most of the space used in both the forward and the inverted indices.

Major Data Structures Hit Lists - Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. - Plain hits include everything else. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document

Major Data Structures Forward Index - Each barrel holds a range of wordID’s. - If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID’s with hitlists which correspond to those words. - Requires slightly more storage because of duplicated docIDs.

Major Data Structures Inverted Index - C onsists of the same barrels as the forward index, except that they have been processed by the sorter. - For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. - It points to a doclist of docID’s together with their corresponding hit lists. - This doclist represents all the occurrences of that word in all documents.

The Anatomy of a Large-Scale Hypertextual Web Search Engine

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

The Anatomy of a Large-Scale Hypertextual Web Search Engine

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

MGV Residential Design projects for different clients, including a New Mexico Adobe project-1-.pdf

EUNITED_Advocacy and Public Engagement through Visual Media

DESIGN THINKINGGG PPT 2 TOPIC IDEATION.pptx

DESIGN THINKING CHAPTER 1 PPTT PPT 1.pptx

Hinduism and Its History - PowerPoint Slides.pptx

Service Attributes of Manufactured Parts.pptx