Text preprocessing using Python Working of Google Search Engine Comparison of search engines – Google vs Bing 18/06/2024 BITS Pilani, Pilani Campus Webinar Outline
18/06/2024 BITS Pilani, Pilani Campus Text preprocessing
18/06/2024 BITS Pilani, Pilani Campus Text preprocessing using Python Let us understand how text preprocessing is implemented in Python. We will be using the NLTK (Natural Language Toolkit) library here. NLTK is one of the largest Python libraries for performing various Natural Language Processing tasks that includes tokenization, removal of stopwords , removing punctuation and whitespaces, stemming / lemmatization. # import the necessary libraries import nltk import string import re
18/06/2024 BITS Pilani, Pilani Campus Common Text preprocessing steps
Tokenization - The first step in vocabulary generation is tokenization. We break down each document into individual words or tokens Stopwords - Stopwords are common words like "the," "is," "and," etc., that don't carry significant meaning for information retrieval. We typically remove stopwords to reduce noise in the index. Stemming - Stemming is a technique used to extract the base form of the words by removing affixes / prefixes from them. Lemmatization - Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma. 18/06/2024 BITS Pilani, Pilani Campus Common Text preprocessing steps
Code to tokenize text to words import nltk from nltk import word_tokenize sent = "I study Machine Learning. Current course is Information Retrieval." print( word_tokenize (sent)) Output ['I', 'study', 'Machine', 'Learning', '.', 'Current', 'course', 'is', 'Information', 'Retrieval', '.'] 18/06/2024 BITS Pilani, Pilani Campus Tokenization
Remove default stopwords Stopwords are words that do not contribute to the meaning of a sentence. Hence, they can safely be removed without causing any change in the meaning of the sentence. The NLTK library has a set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens. 18/06/2024 BITS Pilani, Pilani Campus Stop words
18/06/2024 BITS Pilani, Pilani Campus List of default Stop words
Code to remove stopwords 18/06/2024 BITS Pilani, Pilani Campus Stopwords
Stemming is the process of getting the root form of a word. Stem or root is the part to which inflectional affixes (-ed, - ize , -de, -s, etc.) are added. The stem of a word is created by removing the prefix or suffix of a word. So, stemming a word may not result in actual words. There are mainly three algorithms for stemming. These are the Porter Stemmer, the Snowball Stemmer and the Lancaster Stemmer. Porter Stemmer is the most common among them. 18/06/2024 BITS Pilani, Pilani Campus Stemming
Code to implement stemming – PorterStemmer 18/06/2024 BITS Pilani, Pilani Campus Stemming from nltk.stem import PorterStemmer def porter_stemmer (text): tokens = word_tokenize (text) for index in range( len (tokens)): # stem word to each word stem_word = stemmer.stem (tokens[index]) # update tokens list with stem word tokens[index] = stem_word # join list with space separator as string return ' '.join(tokens) stemmer = PorterStemmer () ex_stem = "Programmers program with programming languages" stem_result = porter_stemmer ( ex_stem ) print( f"Result after stemming technique :- \n{ stem_result }") ## Output:: programm program with program languag
18/06/2024 BITS Pilani, Pilani Campus Stemming Some common rules of Snowball stemming are: Word Stem cared care university univers fairly fair easily easili singing sing sings sing sung sung singer singer sportingly sport ILY -----> ILI LY -----> Nil SS -----> SS S -----> Nil ED -----> E,Nil
18/06/2024 BITS Pilani, Pilani Campus Stemming Code to implement Snowball stemming are: from nltk.stem.snowball import SnowballStemmer #the stemmer requires a language parameter snow_stemmer = SnowballStemmer (language=' english ') #list of tokenized words words = ['cared','university','fairly','easily','singing’,'sings','sung','singer','sportingly’] #stem's of each word stem_words = [] for w in words: x = snow_stemmer.stem (w) stem_words.append (x) #print stemming results for e1,e2 in zip( words,stem_words ): print(e1+' ----> '+e2)
Lemmatization also converts a word to its root form. The only difference is that lemmatization ensures that the root word belongs to the language. We will get valid words if we use lemmatization. Wordnet is a popular lexical database of the English language that is used by NLTK internally. WordNetLemmatizer is an NLTK lemmatizer built using the Wordnet database and is quite widely used. We also need to provide a context for the lemmatization. So, we add the part-of-speech as a parameter explicitly otherwise it assumes POS to be a noun by default and the lemmatization will not give the right results. 18/06/2024 BITS Pilani, Pilani Campus Lemmatization
Code to implement Lemmatization 18/06/2024 BITS Pilani, Pilani Campus Lemmatization from nltk.stem import WordNetLemmatizer def lemmatization(text): tokens = word_tokenize (text) for index in range( len (tokens)): lemma_word = lemma.lemmatize (tokens[index]) tokens[index] = lemma_word return ' '.join(tokens) lemma = WordNetLemmatizer () ex_lemma = "Programmers program with programming languages" lemma_result = lemmatization( ex_lemma ) print( f"Result of lemmatization \n{ lemma_result }") ## Output:: Programers program with programing language
Code to implement Lemmatization with POS tags 18/06/2024 BITS Pilani, Pilani Campus Lemmatization from nltk.stem import WordNetLemmatizer from nltk import word_tokenize,pos_tag text = "She jumped into the river and breathed heavily" wordnet = WordNetLemmatizer () for token,tag in pos_tag ( word_tokenize (text)): pos =tag[0].lower() if pos not in ['a', 'r', 'n', 'v']: pos ='n' print(token,"--->", wordnet.lemmatize ( token,pos )) Output: She ---> She jumped ---> jump into ---> into the ---> the river ---> river and ---> and breathed ---> breathe heavily ---> heavily
Code to preprocess text and create simple inverted index 18/06/2024 BITS Pilani, Pilani Campus Inverted Index import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk.stem import PorterStemmer , WordNetLemmatizer from collections import defaultdict import string # Download necessary NLTK data nltk.download (' punkt ') nltk.download (' stopwords ') nltk.download ('wordnet') # Sample documents documents = { 1: "This is a sample document.", 2: "This document is another sample document.", 3: "And this is a different document." } # contd ….
18/06/2024 BITS Pilani, Pilani Campus Inverted Index # Initialize stop words, stemmer, and lemmatizer stop_words = set( stopwords.words (' english ')) stemmer = PorterStemmer () lemmatizer = WordNetLemmatizer () # Function to preprocess and tokenize text def preprocess(text, use_stemming =True): # Normalize text text = text.lower () text = text.translate ( str.maketrans ('', '', string.punctuation )) # Tokenize text tokens = word_tokenize (text) # Remove stop words tokens = [word for word in tokens if word not in stop_words ] # Stem or lemmatize tokens if use_stemming : tokens = [ stemmer.stem (word) for word in tokens] else: tokens = [ lemmatizer.lemmatize (word) for word in tokens] return tokens #contd…..
18/06/2024 BITS Pilani, Pilani Campus Inverted Index Output: # Create the inverted index with document frequency inverted_index = defaultdict (list) doc_freq = defaultdict (int) for doc_id , text in documents.items (): tokens = preprocess(text) unique_tokens = set(tokens) for token in unique_tokens : doc_freq [token] += 1 inverted_index [token].append( doc_id ) # Display the inverted index with document frequency print("Inverted Index with Document Frequency:") for word, doc_ids in inverted_index.items (): print(f"{word}: { doc_ids } (Document Frequency: { doc_freq [word]})")
18/06/2024 BITS Pilani, Pilani Campus
Google Search engine is the most trusted search engine and the most widely used. Created by Larry Page and Sergey Brin Google processes over 8.5 billion searches per day Google accounts for 91.54% of the global search engine market 18/06/2024 BITS Pilani, Pilani Campus Google Search Engine
18/06/2024 BITS Pilani, Pilani Campus Stages of Google Search
18/06/2024 BITS Pilani, Pilani Campus 1. Crawling
Crawling. Crawling is the process by which Google discovers new or updated web pages. Googlebot. This is Google’s web crawler, a piece of software designed to explore the web. It fetches web pages and follows the links on those pages to find new URLs. Starting Points. Googlebot begins its crawl from a list of known web addresses from past crawls and sitemaps provided by site owners. New Content Discovery. As Googlebot visits each page, it detects links on the page and adds them to its list of pages to visit next. 18/06/2024 BITS Pilani, Pilani Campus 1. Crawling
18/06/2024 BITS Pilani, Pilani Campus 2. Indexing
Indexing. Once Googlebot crawls a page, the search engine then decides whether to add it to its index, which is a vast database stored across thousands of machines. Content Processing. Google analyzes the content of the page, catalogs, images and videos embedded in the page, and determines the topics covered on the page. 18/06/2024 BITS Pilani, Pilani Campus 2. Indexing
Key Signals. Beyond content, Google checks for key signals like freshness of content, region-specific relevance, and website quality to determine the value of the page. Duplication Checks. To avoid storing duplicate information, The search engine checks if the content already exists in its database.. 18/06/2024 BITS Pilani, Pilani Campus 2. Indexing
Discovering, crawling, and indexing content only make up the first part of the puzzle. Search engines also need a way to rank matching results when a user performs a search. This is the job of search algorithms. PageRank is a Google algorithm that measures the importance of webpages based on the quality and quantity of links pointing to them. These algorithms take into consideration hundreds of factors, including user-specific details like location and search history. 18/06/2024 BITS Pilani, Pilani Campus 3 . Ranking
18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines
Google and Bing are two of the world’s most popular search engines. Google has dominated search for decades. But, since Microsoft integrated AI with OpenAI’s ChatGPT, Microsoft Bing has seen dramatic growth. So, Google vs. Bing—which search engine is better? And how do they differ in terms of user experience and SEO? 18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines
18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines Category Google Bing History Founded in 1998 by Larry Page and Sergey Brin Micrsoft introduced Bing in 2009—replacing its previous search engine, Live Search. Market Share (as on 2023) Google retains an 83.49% share of the global market , although this has fallen from 89.95% in the past three years During the same timeframe, Bing’s share has risen from 6.43% up to 9.19%. Ranking Algorithms for Search Google’s search algorithm, PageRank , was the first to consider a webpage’s backlinks to assess its quality and deliver highly relevant search results. The 'Whole Page Algorithm ' from Bing is a holistic approach to ranking websites and ranking web pages themselves. Rather than just focusing on individual elements like keywords or backlinks, it looks at the entire content of a page and its relevance to the user's query.
18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines Category Google Bing Index Size Google claims to have hundreds of billions of web pages in its index. That’s more than 100,000,000 gigabytes . Google updates its index more frequently than Bing, which means it adds new pages faster. Bing’s index size is estimated to be between 8 to 14 billion webpages . Bing updates its index less frequently than Google. So, it may not have as much recent or new information. AI Algorithms for Search Google introduced its first AI-based search algorithm RankBrain in 2015. RankBrain helps Google understand and rank content. Google recently launched a generative artificial intelligence chatbot, Bard. Bing has integrated OpenAI’s GPT-4, a powerful natural language generation (NLG) model, into its search engine.
18/06/2024 BITS Pilani, Pilani Campus Comparison of Search Engines