Webinar information Retrieval introduction.pptx

BITS Pilani Pilani Campus AIMLCZG537/DSECLZG537 Information Retrieval Webinar - 1 : 18-06- 2024 Prof Radha Sridharan 18/06/2024 BITS Pilani, Pilani Campus

Text preprocessing using Python Working of Google Search Engine Comparison of search engines – Google vs Bing 18/06/2024 BITS Pilani, Pilani Campus Webinar Outline

18/06/2024 BITS Pilani, Pilani Campus Text preprocessing

18/06/2024 BITS Pilani, Pilani Campus Text preprocessing using Python Let us understand how text preprocessing is implemented in Python. We will be using the NLTK (Natural Language Toolkit) library here. NLTK is one of the largest Python libraries for performing various Natural Language Processing tasks that includes tokenization, removal of stopwords , removing punctuation and whitespaces, stemming / lemmatization. # import the necessary libraries import nltk import string import re

18/06/2024 BITS Pilani, Pilani Campus Common Text preprocessing steps

Tokenization - The first step in vocabulary generation is tokenization. We break down each document into individual words or tokens Stopwords - Stopwords are common words like "the," "is," "and," etc., that don't carry significant meaning for information retrieval. We typically remove stopwords to reduce noise in the index. Stemming - Stemming is a technique used to extract the base form of the words by removing affixes / prefixes from them. Lemmatization - Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. 18/06/2024 BITS Pilani, Pilani Campus Common Text preprocessing steps

Code to tokenize text to words import nltk from nltk import word_tokenize sent = "I study Machine Learning. Current course is Information Retrieval." print( word_tokenize (sent)) Output ['I', 'study', 'Machine', 'Learning', '.', 'Current', 'course', 'is', 'Information', 'Retrieval', '.'] 18/06/2024 BITS Pilani, Pilani Campus Tokenization

Remove default stopwords Stopwords are words that do not contribute to the meaning of a sentence. Hence, they can safely be removed without causing any change in the meaning of the sentence. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens. 18/06/2024 BITS Pilani, Pilani Campus Stop words

18/06/2024 BITS Pilani, Pilani Campus List of default Stop words

Code to remove stopwords 18/06/2024 BITS Pilani, Pilani Campus Stopwords

Stemming is the process of getting the root form of a word. Stem or root is the part to which inflectional affixes (-ed, - ize , -de, -s, etc.) are added. The stem of a word is created by removing the prefix or suffix of a word. So, stemming a word may not result in actual words. There are mainly three algorithms for stemming. These are the Porter Stemmer, the Snowball Stemmer and the Lancaster Stemmer. Porter Stemmer is the most common among them. 18/06/2024 BITS Pilani, Pilani Campus Stemming

Code to implement stemming – PorterStemmer 18/06/2024 BITS Pilani, Pilani Campus Stemming from nltk.stem import PorterStemmer def porter_stemmer (text): tokens = word_tokenize (text) for index in range( len (tokens)): # stem word to each word stem_word = stemmer.stem (tokens[index]) # update tokens list with stem word tokens[index] = stem_word # join list with space separator as string return ' '.join(tokens) stemmer = PorterStemmer () ex_stem = "Programmers program with programming languages" stem_result = porter_stemmer ( ex_stem ) print( f"Result after stemming technique :- \n{ stem_result }") ## Output:: programm program with program languag

18/06/2024 BITS Pilani, Pilani Campus Stemming Some common rules of Snowball stemming are: Word Stem cared care university univers fairly fair easily easili singing sing sings sing sung sung singer singer sportingly sport ILY -----> ILI LY -----> Nil SS -----> SS S -----> Nil ED -----> E,Nil

18/06/2024 BITS Pilani, Pilani Campus Stemming Code to implement Snowball stemming are: from nltk.stem.snowball import SnowballStemmer #the stemmer requires a language parameter snow_stemmer = SnowballStemmer (language=' english ') #list of tokenized words words = ['cared','university','fairly','easily','singing’,'sings','sung','singer','sportingly’] #stem's of each word stem_words = [] for w in words: x = snow_stemmer.stem (w) stem_words.append (x) #print stemming results for e1,e2 in zip( words,stem_words ): print(e1+' ----> '+e2)

18/06/2024 BITS Pilani, Pilani Campus Lemmatization

Lemmatization also converts a word to its root form. The only difference is that lemmatization ensures that the root word belongs to the language. We will get valid words if we use lemmatization. Wordnet is a popular lexical database of the English language that is used by NLTK internally. WordNetLemmatizer is an NLTK lemmatizer built using the Wordnet database and is quite widely used. We also need to provide a context for the lemmatization. So, we add the part-of-speech as a parameter explicitly otherwise it assumes POS to be a noun by default and the lemmatization will not give the right results. 18/06/2024 BITS Pilani, Pilani Campus Lemmatization

Code to implement Lemmatization 18/06/2024 BITS Pilani, Pilani Campus Lemmatization from nltk.stem import WordNetLemmatizer def lemmatization(text): tokens = word_tokenize (text) for index in range( len (tokens)): lemma_word = lemma.lemmatize (tokens[index]) tokens[index] = lemma_word return ' '.join(tokens) lemma = WordNetLemmatizer () ex_lemma = "Programmers program with programming languages" lemma_result = lemmatization( ex_lemma ) print( f"Result of lemmatization \n{ lemma_result }") ## Output:: Programers program with programing language

Code to implement Lemmatization with POS tags 18/06/2024 BITS Pilani, Pilani Campus Lemmatization from nltk.stem import WordNetLemmatizer from nltk import word_tokenize,pos_tag text = "She jumped into the river and breathed heavily" wordnet = WordNetLemmatizer () for token,tag in pos_tag ( word_tokenize (text)): pos =tag[0].lower() if pos not in ['a', 'r', 'n', 'v']: pos ='n' print(token,"--->", wordnet.lemmatize ( token,pos )) Output: She ---> She jumped ---> jump into ---> into the ---> the river ---> river and ---> and breathed ---> breathe heavily ---> heavily

Code to preprocess text and create simple inverted index 18/06/2024 BITS Pilani, Pilani Campus Inverted Index import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem import PorterStemmer , WordNetLemmatizer from collections import defaultdict import string # Download necessary NLTK data nltk.download (' punkt ') nltk.download (' stopwords ') nltk.download ('wordnet') # Sample documents documents = { 1: "This is a sample document.", 2: "This document is another sample document.", 3: "And this is a different document." } # contd ….

18/06/2024 BITS Pilani, Pilani Campus Inverted Index # Initialize stop words, stemmer, and lemmatizer stop_words = set( stopwords.words (' english ')) stemmer = PorterStemmer () lemmatizer = WordNetLemmatizer () # Function to preprocess and tokenize text def preprocess(text, use_stemming =True): # Normalize text text = text.lower () text = text.translate ( str.maketrans ('', '', string.punctuation )) # Tokenize text tokens = word_tokenize (text) # Remove stop words tokens = [word for word in tokens if word not in stop_words ] # Stem or lemmatize tokens if use_stemming : tokens = [ stemmer.stem (word) for word in tokens] else: tokens = [ lemmatizer.lemmatize (word) for word in tokens] return tokens #contd…..

18/06/2024 BITS Pilani, Pilani Campus Inverted Index Output: # Create the inverted index with document frequency inverted_index = defaultdict (list) doc_freq = defaultdict (int) for doc_id , text in documents.items (): tokens = preprocess(text) unique_tokens = set(tokens) for token in unique_tokens : doc_freq [token] += 1 inverted_index [token].append( doc_id ) # Display the inverted index with document frequency print("Inverted Index with Document Frequency:") for word, doc_ids in inverted_index.items (): print(f"{word}: { doc_ids } (Document Frequency: { doc_freq [word]})")

18/06/2024 BITS Pilani, Pilani Campus

Google Search engine is the most trusted search engine and the most widely used. Created by Larry Page and Sergey Brin Google processes over 8.5 billion searches per day Google accounts for 91.54% of the global search engine market 18/06/2024 BITS Pilani, Pilani Campus Google Search Engine

18/06/2024 BITS Pilani, Pilani Campus Stages of Google Search

18/06/2024 BITS Pilani, Pilani Campus 1. Crawling

Crawling. Crawling is the process by which Google discovers new or updated web pages. Googlebot. This is Google’s web crawler, a piece of software designed to explore the web. It fetches web pages and follows the links on those pages to find new URLs. Starting Points. Googlebot begins its crawl from a list of known web addresses from past crawls and sitemaps provided by site owners. New Content Discovery. As Googlebot visits each page, it detects links on the page and adds them to its list of pages to visit next. 18/06/2024 BITS Pilani, Pilani Campus 1. Crawling

18/06/2024 BITS Pilani, Pilani Campus 2. Indexing

Indexing. Once Googlebot crawls a page, the search engine then decides whether to add it to its index, which is a vast database stored across thousands of machines. Content Processing. Google analyzes the content of the page, catalogs, images and videos embedded in the page, and determines the topics covered on the page. 18/06/2024 BITS Pilani, Pilani Campus 2. Indexing

Key Signals. Beyond content, Google checks for key signals like freshness of content, region-specific relevance, and website quality to determine the value of the page. Duplication Checks. To avoid storing duplicate information, The search engine checks if the content already exists in its database.. 18/06/2024 BITS Pilani, Pilani Campus 2. Indexing

Discovering, crawling, and indexing content only make up the first part of the puzzle. Search engines also need a way to rank matching results when a user performs a search. This is the job of search algorithms. PageRank is a Google algorithm that measures the importance of webpages based on the quality and quantity of links pointing to them. These algorithms take into consideration hundreds of factors, including user-specific details like location and search history. 18/06/2024 BITS Pilani, Pilani Campus 3 . Ranking

Key Ranking factors Backlinks Relevance Freshness Page speed Mobile friendliness 18/06/2024 BITS Pilani, Pilani Campus 3 . Ranking

18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines

Google and Bing are two of the world’s most popular search engines. Google has dominated search for decades. But, since Microsoft integrated AI with OpenAI’s ChatGPT, Microsoft Bing has seen dramatic growth. So, Google vs. Bing—which search engine is better? And how do they differ in terms of user experience and SEO? 18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines

18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines Category Google Bing History Founded in 1998 by Larry Page and Sergey Brin Micrsoft introduced Bing in 2009—replacing its previous search engine, Live Search. Market Share (as on 2023) Google retains an 83.49% share of the global market , although this has fallen from 89.95% in the past three years During the same timeframe, Bing’s share has risen from 6.43% up to 9.19%. Ranking Algorithms for Search Google’s search algorithm, PageRank , was the first to consider a webpage’s backlinks to assess its quality and deliver highly relevant search results. The 'Whole Page Algorithm ' from Bing is a holistic approach to ranking websites and ranking web pages themselves. Rather than just focusing on individual elements like keywords or backlinks, it looks at the entire content of a page and its relevance to the user's query.

18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines Category Google Bing Index Size Google claims to have hundreds of billions of web pages in its index. That’s more than 100,000,000 gigabytes . Google updates its index more frequently than Bing, which means it adds new pages faster. Bing’s index size is estimated to be between 8 to 14 billion webpages . Bing updates its index less frequently than Google. So, it may not have as much recent or new information. AI Algorithms for Search Google introduced its first AI-based search algorithm RankBrain in 2015. RankBrain helps Google understand and rank content. Google recently launched a generative artificial intelligence chatbot, Bard. Bing has integrated OpenAI’s GPT-4, a powerful natural language generation (NLG) model, into its search engine.

18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines

18/06/2024 BITS Pilani, Pilani Campus

Webinar information Retrieval introduction.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Webinar information Retrieval introduction.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......