Unit 1 Information Storage and Retrieval

874 views 25 slides Jul 05, 2024
Slide 1
Slide 1 of 25
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25

About This Presentation

Introduction to information retrieval, Major challenges in IR


Slide Content

COURSE CONTENTS
Unit I Introduction to Information Retrieval ( 06 hrs )
Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system
block diagram. Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing,
Probabilistic Indexing Clustering Techniques : Single pass algorithm , Single Link algorithm
Text & Reference Book
Yates & Neto, Modern Information Retrieval, Pearson
Education, ISBN:81-297-0274-6
C.J. Rijsbergen, Information Retrieval,
(www.dcs.gla.ac.uk).,2ndISBN:978- 408709293
CO 1:
UnderstandtheconceptofInformationretrievalandapply
clusteringininformationretrieval.
Prepared By : Prof. Datta S. Shingate

•Retrieval - “Fetch something”
•Data - raw alphanumeric values.
•Information – Processed data.
•Knowledge – What we know.
•Types of Information
•Text
•Images
•Audio
•Video
•Source Code
•Applications/Web services
•XML and structured documents
Definition of IR

Defining Data, Information, Knowledge & Wisdom

Definition of IR
•Goal
Find the documents most relevant to user Query.
•Information Retrieval (IR)
Informationretrieval(IR)maybedefinedasasoftwareprogramthat
dealswiththeorganization,storage,retrievalandevaluationof
informationfromdocumentrepositoriesparticularlytextualinformation.

Data Retrieval Vs Information Retrieval
Data Retrieval Information Retrieval
•Retrieves data based on the keywords in the query
entered by the user.
•Retrieves information based on the similarity
between the query and the document.
•There is no room for errors since it results in
complete system failure.
•Small errors are tolerated and will likely go
unnoticed.
•It has a defined structure with respect to
semantics.
•It is ambiguous and doesn’t have a defined
structure.
•Provides solutions to the user of the database
system.
•Does not provide a solution to the user of the
database system.
•Data Retrieval system produces exact results.•Information Retrieval system produces
approximate results
•Displayed results are not sorted by relevance. •Displayed results are sorted by relevance
•Eg : SQL •Eg : Google Search Engine

Text mining and Information Retrieval (IR)
Text mining is a process of extracting useful information and patterns from
a large volume of text databases.

IR SystemBlock Diagram
Fig : Typical IR System (Black Box) Fig : Information Retrieval (IR) Process

Evaluation Criteria
•Recall – is defined as the portion of the total relevant document that is
retrieved.
Recall =
No of Relevant document retrieved
* 100
??????��??????� �� �� �����??????�� ��������� ??????� �ℎ� �������??????��
•Precision - is defined as the portion of the document retrieved that is
relevant.
Precision =
No of Relevant document retrieved
* 100
??????��??????� �� �� ��������� ����??????����

Automatic Text Analysis
1.Document Representative
2.Text Summarization
3.Luhn’s Idea
Document
Document
Representative
Predictions from
Frequency of
Words
Conflation
Algorithm

Luhn’s Idea
Stop
words
TheLuhn’sIdeaSays:
->Toolowfrequentwordsarenotsignificant.
->Toohighfrequentwordsarealsonotsignificant
(e.g.“is”,“and”).
->Removinglowfrequentwordsiseasy.
--Setaminimumfrequency-threshold
->Removingcommon(highfrequent)words:
--Settingamaximumfrequencythreshold
(statisticallyobtained)
--Comparingtoacommon-wordlist
->Usedforsummarizingtechnicaldocuments.

Conflation Algorithm
1.Open and read each input file and create a single index file.
2.Remove high frequency words (stop words) .
3.Remove all suffixes/affixes from each word if present.
4.Detecting equivalent stems.
5.Store in index file.
{Compute, Computer, Computing} → Comput
{Walks, Walking, Walker} → Walk
{develop, developing, development, developments } → develop

High frequency words

Indexing Subsystem

Clustering in Information Retrieval
Medical
Legal
Financial
Documents Collection

Clustering in Information Retrieval
Similarity matrix
Objects: {1,2,3,4,5,6}
Threshold: .89
Graph Theoretic Approach
C1 :{1,4,5,6}
C2 :{2}
C3 : {3}

Similarity Measures

Jaccard’s Similarity Example

Single Pass Clustering Algorithm
1.Assign the first document D1 as the representative for C1.
2.For Di, calculate the similarity S with the representative for each existing cluster.
3.If Smax is greater than a threshold value ST, add the item to the corresponding cluster
and recalculate the cluster representative; otherwise, use Di to initiate a new cluster.
4.If an item Di remains to be clustered, return to step 2.

Example of Single Pass Clustering Technique
Suppose that we have the following set of documents and terms, and
that we are interested in clustering the terms using the single pass
method. Threshold value is 10

Example of Single Pass Clustering Technique

Example of Single Pass Clustering Technique

Example of Single Pass Clustering Technique

Example of Single Pass Clustering Technique

Single Link Clustering Algorithm
Dissimilarity
matrix:

Thanks
Tags