Information Retrieval Systems_Lecture_1_Text_Analytics.pptx

SudheerKumar723333 82 views 32 slides Jun 08, 2024
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

Information Retrieval Systems_Lecture_1_Text_Analytics.pptx


Slide Content

INFORMATION RETRIEVAL SYSTEMS(IRS)

. Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that can identify concepts, patterns, topics, keywords and other attributes in the data. It's also known as text analytics Information Extraction: The automatic extraction of structured data such as entities, entities relationships, and attributes describing entities from an unstructured source is called information extraction. Natural Language Processing - Natural Language Processing or NLP refers to the branch of  Artificial Intelligence   that gives the machines the ability to read, understand and derive meaning from human languages.

. Some applications of NLP include Translation Tools : Google Translate, Amazon Translate, Virtual Assistants : Siri , Cortana , Google Home, Alexa , etc can not only talk to you but understand \ commands given to them. Data mining Data mining refers to the extraction of useful data, hidden patterns from large data sets. Data mining tools can predict behaviors and future trends that allow businesses to make a better data-driven decision. Information Retrieval Information retrieval is the process of accessing data resources. Usually documents or other unstructured data for the purpose of sharing knowledge. More specifically, an information retrieval system provides an interface between users and large data repositories – especially textual repositories

.

. What is text analytics and text analysis? Text analytics is the quantitative data that you can obtain by analyzing patterns in multiple samples of text. It is presented in charts, tables, or graphs. Textual Analysis is a term used to study and understand texts. It includes exploring the languages, symbols, patterns, pictures in the text.   Text analysis vs. text analytics Text analytics helps you determine if there’s a particular trend or pattern from the results of analyzing thousands of pieces of feedback. Meanwhile, you can use text analysis to determine whether a customer’s feedback is positive or negative.

.

. Text mining/analysis activities or tasks Document classification Information retrieval (e.g., search engines) Supervised classification (e.g., guessing genres) Unsupervised clustering (e.g., alternative “genres”) Corpora comparison e.g., political speeches -  A text corpus is a large and unstructured set of texts (nowadays usually electronically stored and processed) Entity recognition/ extraction e.g “Mark Zuckerberg is one of the founders of Facebook , a company from the United States” we can identify three types of entities: “Person”: Mark Zuckerberg . “Company”: Facebook . “Location”: United States. Data Visualization Word frequency Lists of words and their frequencies. See also:  Word counts are amazing . Collocation Words commonly appearing near each other more... N- grams Common two-, three-, etc.- word phrases Concordance The instances and contexts of a given word or set of words. Detecting clusters e.g . of document features (i.e.,  topic modeling )

.

. Textual Analysis is a term used to study and understand texts. It includes exploring the languages, symbols, patterns, pictures in the text.  Textual Analysis helps us understand and have a detailed idea about how people communicate their ideologies and thoughts and experiences through texts. Most of the time, texts such as a video interview, a newspaper article, or a voice message tells a lot about how a certain topic is being influenced by a certain group of people.

Approaches of Textual Analysis 1) Rhetorical Criticism - It is a systematic study to describe, analyze, interpret and evaluate the messages that are hidden in the texts. The process of rhetorical criticism comprises of five steps: Purpose of the message The understanding of the historical, cultural and social context in the message. Using it to evaluate society.  Can build a theory and its applications.  Teaching people how effective persuasion works. 2) Content Analysis: Is used to identify the messages occurring in the texts and enumerating them. It is generally not preferred by researchers as they work on the data that already exists rather than generating new data. The content analysis makes it hard due to the practice of coding the data and labelling them, which in turn includes training the researchers for the same.  Qualitative content analysis- focuses on the actual meaning embedded in the message rather than the number of times the message has occurred.  Quantitative content analysis- a step-by-step procedure where the research questions are answered in a proper manner.

. 3) Interaction Analysis: It includes studies right from linguistic features like words and sentence formation to non-verbal factors like hand gestures and eye contact, topics that people discuss, the purpose of specific actions and speeches. Interaction analysis is viewed as a much more complex process that includes massive information on the topic and a strong ability to coordinate on the part of the researcher. It takes for a researcher to arrange for group meetings and observe the functional messages that are being exchanged during the discussion and the conversants ’ actions.  4) Performance Studies: Focuses on the richness of the text and its aesthetics while. A researcher puts up a performance using the texts to act out how it affects a conversation. Select: identify a text to examine. Play: try on different vocal and body language. Testing: conclude ways of understanding. Choosing: select a valid interpretation. Repeating: refine the chosen interpretation. Presenting: report of what has been discovered.

Introduction Textual analysis  is a method of studying a text in order to understand the various meanings by identifying the who, what, when, where, why, and how of a text. Text Analysis vs. Text Mining vs. Text Analytics  Types of text analysis techniques 1) Text classification In text classification, the text analysis software learns how to associate certain keywords with specific topics, users’ intentions, or sentiments. It does so by using the following methods:  Rule-based classification assigns tags to the text based on predefined rules for semantic components or syntactic patterns. Machine learning-based systems work by training the text analysis software with examples and increasing their accuracy in tagging the text. They use linguistic models like Naive Bayes , Support Vector Machines, and Deep Learning to process structured data, categorize words, and develop a semantic understanding between them. For example, a favorable review often contains words like  good, fast,  and  great.  However, negative reviews might contain words like  unhappy, slow,  and  bad . Data scientists train the text analysis software to look for such specific terms and categorize the reviews as positive or negative. This way, the customer support team can easily monitor customer sentiments from the reviews.

. 2) Text extraction Text extraction scans the text and pulls out key information. It can identify keywords, product attributes, brand names, names of places, and more in a piece of text. The extraction software applies the following methods: Regular expression (REGEX): This is a formatted array of symbols that serves as a precondition of what needs to be extracted. Conditional random fields (CRFs): This is a machine learning method that extracts text by evaluating specific patterns or phrases. It is more refined and flexible than REGEX.  For example, you can use text extraction to monitor brand mentions on social media. Manually tracking every occurrence of your brand on social media is impossible. Text extraction will alert you to mentions of your brand in real time.  3) Topic modeling Topic modeling methods identify and group related keywords that occur in an unstructured text into a topic or theme. These methods can read multiple text documents and sort them into themes based on the frequency of various words in the document. Topic modeling methods give context for further analysis of the documents. For example, you can use topic modeling methods to read through your scanned document archive and classify documents into invoices, legal documents, and customer agreements. Then you can run different analysis methods on invoices to gain financial insights or on customer agreements to gain customer insights. 4) PII redaction PII redaction automatically detects and removes personally identifiable information (PII) such as names, addresses, or account numbers from a document. PII redaction helps protect privacy and comply with local laws and regulations. For example, you can analyze support tickets and knowledge articles to detect and redact PII before you index the documents in the search solution. After that, search solutions are free of PII in documents.

. What is text analytics? Text analytics is the quantitative data that you can obtain by analyzing patterns in multiple samples of text. It is presented in charts, tables, or graphs.  Text analysis vs. text analytics Text analytics helps you determine if there’s a particular trend or pattern from the results of analyzing thousands of pieces of feedback. Meanwhile, you can use text analysis to determine whether a customer’s feedback is positive or negative.  Textual Analysis is a term used to study and understand texts. It includes exploring the languages, symbols, patterns, pictures in the text. 

Basic Concepts of Information Retrieval Information retrieval (IR) is the study of helping users to find information that matches their information needs. Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information. Architecture of Information Retrieval System

.  What is Information Retrieval (IR)? “Information retrieval is a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information.” (Salton’68) n Primary focus of IR since the 50’s has been on text and documents and its primary goal is “Retrieve all the documents which are relevant to a user query, while retrieving as few non-relevant documents as possible.”

. The architecture of an IR system typically includes the following components: User interface:   This component enables user interaction with the system. Filters, search box, and other tools may be included in it so that the users may input their queries and hone the search results. Query processing:  The query processing component transforms processed user queries into a form that may be utilized to search the index. The parsing, expansion, and rewriting of queries are just a few examples of its possible sub-components. Indexing:  Making an index of the document collection is the indexing component’s responsibility. All the terms involved in the document collection are normally listed in this index, together with the information about how frequently and where they appear. Retrieval model: In order to obtain and rank documents in response to a user query, the retrieval model must be used. Boolean, vector-space, and probabilistic models are only a few of the various retrieval models available. Ranking: The ranking element chooses the order in which the pages are displayed to the user, depending on how relevant they are to the query. A relevance score is generally calculated for each document by the ranking component using the retrieval model, and the documents are then sorted according to the score. Data collection:   A document collection is a set of papers that must be searched. It may include a variety of sources, including written materials like books, journals, and websites. Storage:  The document collection, index, and other data that the system requires to execute searches and retrievals are held in the component that handles storage.

.

. The user with information need issues a query (user query) to the retrieval system through the query operations module. The retrieval module uses the document index to retrieve those documents that contain some query terms (such documents are likely to be relevant to the query), compute relevance scores for them, and then rank the retrieved documents according to the scores . The ranked documents are then presented to the user. The document collection is also called the text database, which is indexed by the indexer for efficient retrieval. Query Types A user query represents the user’s information needs, which is in one of the following forms : Keyword queries - The user expresses his/her information needs with a list of (at least one) keywords (or terms) aiming to find documents that contain some (at least one) or all the query terms. Single Keyword Query: This query retrieves documents containing the term "Machine Learning" anywhere Multi-word Keyword Query:: Example: "Artificial Intelligence applications" This query retrieves documents containing the terms "Artificial Intelligence" and "applications" in any order and possibly with other words in between.

. 2) Boolean queries- The user can use Boolean operators, AND, OR, and NOT to construct complex queries. Thus, such queries consist of terms and Boolean operators. Boolean AND Operator : Retrieves documents containing both "cat" and "dog". Query: cat AND dog Boolean OR Operator: Retrieves documents containing either "cat" or "dog" or both. Query: cat OR dog Boolean NOT Operator: Retrieves documents containing "cat" but not "dog". Query: cat NOT dog Boolean queries would work in practice: Document 1: "The domestic cat is a popular pet." Document 2: "Dogs are known for their loyalty." Document 3: "My neighbor has both a cat and a dog." Document 4: "Cats and dogs are common household pets." Using the Boolean queries: Query : cat AND dog -- Result: No documents match this query because there are no documents containing both "cat" and "dog". Query: cat OR dog Result: Documents 1, 2, 3, and 4 match this query since they contain either "cat" or "dog". Query : cat NOT dog Result: Documents 1 and 3 match this query since they contain "cat" but not "dog".

. 3) Phrase queries - Such a query consists of a sequence of words that makes up a phrase. Each returned document must contain at least one instance of the phrase. In a search engine, a phrase query is normally enclosed with double quotes. Ex:- "Machine learning algorithms are used in various applications." 4)Proximity queries The proximity query is a relaxed version of the phrase query and can be a combination of terms and phrases. Proximity queries seek the query terms within close proximity to each other. The closeness is used as a factor in ranking the returned documents or pages. 5)Full Documents queries - When the query is a full document, the user wants to find other documents that are similar to the query document. Some search engines (e.g., Google) allow the user to issue such a query by providing the URL of a query page. 6)Natural Language queries - This is the most complex case, and also the ideal case. The user expresses his/her information need as a natural language question. The system then finds the answer.

. 7 ) Query Operations Module - The query operations module can range from very simple to very complex. In the simplest case, it does nothing but just pass the query to the retrieval engine after some simple pre-processing, e.g., removal of stop words (words that occur very frequently in text but have little meaning, e.g., “the”, “a”, “in”, etc). In more complex cases, it needs to transform natural language queries into executable queries. It may also accept user feedback and use it to expand and refine the original queries. This is usually called relevance feedback. 8) Indexer - The indexer is the module that indexes the original raw documents in some data structures to enable efficient retrieval. The result is the document index. 9) Inverted Index - Another type of indexing scheme, called the inverted index, which is used in search engines and most IR systems. An inverted index is easy to build and very efficient to search. 10) Ranking - The retrieval system computes a relevance score for each indexed document to the query. According to their relevance scores, the documents are ranked and presented to the user. Note that it usually does not compare the user query with every document in the collection, which is too inefficient. Instead, only a small subset of the documents that contains at least one query term is first found from the index and relevance scores with the user query are then computed only for this subset of documents.

Text Preprocessing Text preprocessing refers to the  process of converting a human language text into a machine-interpretable text which can be used for further processing for a predictive modeling task. Lowercasing Removing Punctuations Removing Stopwords Stemming Lemmatization Removing Emojis Removing URLs 1) Lowercasing It is very easy to lowercase the text, by simply using the inbuilt  lower  function. def lowercase_text (text): return text.lower () text = 'My name is BHAVANA JAMALPUR.' print( lowercase_text (text))

. 2. Removing Punctuations All the special characters (punctuations) are stored in PUNCT_TO_REMOVE. Using the  translate method  all punctuations are mapped to whitespaces. Further, the join function removes the leading whitespaces . import string PUNCT_TO_REMOVE = string.punctuation def remove_punctuation (text): return ' '.join( text.translate ( str.maketrans ('', '', PUNCT_TO_REMOVE)).split()) text = "I'm having a lot of punctuations!!. All special characters will be removed ;) :) Is it so ? ## Yes :( I will. “ print( remove_punctuation (text))

. 3. Removing Stopwords A stopword is a  commonly used word  (such as 'the', 'a', 'an', 'in' ...) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. Here we have imported stopwords of  English  language using NLTK. After this, we removed all the stopwords present in the text.

Search Engines . A search engine is a software system or online service that allows users to find information on the internet by entering keywords or phrases, and then it retrieves and presents relevant web pages or documents related to the user’s query . Find the information you are searching for displayed as a row of answers or results known as search engine results pages (SERPs).

. v

Basically all search engines go through  three stages :

. Crawling - This stage involves scanning the sites and obtaining information about everything that is contained there: page title, keywords, layout, pages that it links to – at a bare minimum. This task is performed by special software robots, called  “spiders”  or  “crawlers ”. A spider or web crawler explores the Internet, looking for materials to add to the search engine's index. These web crawlers can examine a website's entire system and sub-domains, including video and graphic material. Indexing - Web indexing refers to indexing the contents of a website or the entire Internet . Search engine indexing is the process of compiling, analysing and storing data to streamline immediate and accurate extraction of information . Ranking :  When a search engine receives a query, an algorithm searches the index for relevant data and then arranges it into hierarchical order. Ranking refers to the order in which search engine results pages (SERPs) appear .

. Google , Yahoo, Bing, Baidu , and DuckDuckGo are popular search engines. Google is one of the most used search engines worldwide that is used with the Chrome browser.  Google - Google Search (commonly referred to as Google) Founders Larry Page and Sergey Brin developed the idea that sites linked by other websites Google employs advanced algorithms to provide consumers with the most accurate results. The order in which Google returns search results depends on a priority ranking system known as  PageRank . Google expanded its ranking algorithm with hundreds of other parameters over the years, such as the machine learning algorithm  RankBrain .  https://www.slideserve.com/gore/internet-browsing-surfing-powerpoint-ppt-presentation

Thank You
Tags