Information Retrieval basic presentation

ThangamaniMani9 20 views 10 slides May 20, 2024
Slide 1
Slide 1 of 10
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10

About This Presentation

Information Retrieval


Slide Content

HINDUSTHAN INSTITUTE OF TECHNOLOGY DEPARTMENT OF COMPUER SCIENCE AND ENGINEERING 20CS512 – INFORMATION RETRIEVAL Dr. M. Thangamani

Architecture of the IR

Components of the IR Document Collection : The document collection is the set of documents that the IRS indexes and searches. Documents can be in various formats, including text, images, audio, video, or a combination of these. Preprocessing: Preprocessing involves transforming the raw documents into a form that can be easily indexed and searched. This may include tokenization, stemming, stop-word removal, and other text processing techniques. Indexing : Indexing is the process of creating a data structure (such as an inverted index) that allows for efficient searching and retrieval of documents. The index maps terms or keywords to the documents in which they appear. Query Processing : Query processing involves interpreting and transforming the user's query into a form that can be matched against the indexed documents. This may include parsing, tokenization, stemming, and other query processing techniques. Search and Retrieval: Search and retrieval involves searching the index to find documents that match the query. This may involve ranking the documents based on their relevance to the query and returning the most relevant documents to the user. User Interface : The user interface is the front end of the IRS that allows users to enter queries, view search results, and interact with the system. The user interface may include features such as faceted search, filtering, sorting, and visualization. Relevance Feedback : Relevance feedback involves collecting feedback from users about the relevance of the retrieved documents and using this feedback to improve the search results. This may involve adjusting the search algorithm, updating the index, or re-ranking the documents.

Process of querying, indexing and retrieval system

Information Retrieval Model basic information retrieval model. Classical IR Model Boolean Model Vector Space Model Probability Distribution Model

Continue.. Boolean Model : This model is based on Boolean logic, where a query is formulated using logical operators such as AND, OR, and NOT. The model returns documents that satisfy the Boolean expression specified in the query. Vector Space Model (VSM): This model represents documents and queries as vectors in a high-dimensional space, where each dimension corresponds to a term. The similarity between a document and a query is measured using a similarity metric, such as the cosine similarity. Probabilistic Model : This model estimates the probability of a document being relevant to a query based on the probability of the query terms appearing in the document. One of the most well-known probabilistic models is the Binary Independence Model (BIM).

Visualization Interface Scatter Plots: Scatter plots are used to display the relationship between two variables, and they can be used to show the distribution of search results based on two different dimensions, such as relevance and publication date. Bar Charts : Bar charts are used to display the frequency or count of items in different categories, and they can be used to show the distribution of search results based on different facets, such as document type, author, or topic. Line Charts : Line charts are used to display trends over time, and they can be used to show the distribution of search results based on a temporal dimension, such as publication date or time of access. Heat Maps: Heat maps are used to display data in a matrix format, where the values are represented by colors. They can be used to show the distribution of search results based on two dimensions, such as relevance and publication date. Word Clouds : Word clouds are used to display the frequency of terms in a text, with more frequent terms displayed in larger font sizes. They can be used to show the distribution of terms in the search results or to highlight the most relevant terms in a document. Network Graphs : Network graphs are used to display relationships between entities, such as documents, authors, or topics. They can be used to show the connections between documents based on citations, references, or other relationships. Map Visualizations: Map visualizations are used to display geographical data, and they can be used to show the distribution of search results based on geographical dimensions, such as location or region.

Continue.. Construct the term matrix for the following document and query Documemt : 1.Taj mahal is a beautiful monument. 2.Victoria Memorial is also a monument. 3.I like to visit agra .

Construct the term matrix for the following document and query Documemt : 1.Taj mahal is a beautiful monument. 2.Victoria Memorial is also a monument. 3.I like to visit agra . Continue..

Term Document Matrix D1 D2 D3 Taj 1 mahal 1 0 0 is 1 1 0 a 1 1 1 beautiful 1 0 0 monument 1 1 0 Victoria 1 0 Memorial 1 0 also 1 0 I 0 0 1 like 0 1 to 0 1 visit 0 1 agra 0 1 In this matrix, the rows represent the terms, and the columns represent the documents (D1, D2, and D3). The entries in the matrix represent the term frequency in each document. For example, the term " Taj " appears once in document D1 and not in documents D2 and D3, so its entry in the matrix is (1, 0, 0).
Tags