Unit 1 Information Storage and Retrieval

COURSE CONTENTS
Unit I Introduction to Information Retrieval ( 06 hrs )
Basic Concepts of IR, Data Retrieval & Information Retrieval, text mining and IR relation, IR system
block diagram. Automatic Text Analysis: Luhn's ideas, Conflation Algorithm, Indexing and Index Term Weighing,
Probabilistic Indexing Clustering Techniques : Single pass algorithm , Single Link algorithm
Text & Reference Book
Yates & Neto, Modern Information Retrieval, Pearson
Education, ISBN:81-297-0274-6
C.J. Rijsbergen, Information Retrieval,
(www.dcs.gla.ac.uk).,2ndISBN:978- 408709293
CO 1:
UnderstandtheconceptofInformationretrievalandapply
clusteringininformationretrieval.
Prepared By : Prof. Datta S. Shingate

•Retrieval - “Fetch something”
•Data - raw alphanumeric values.
•Information – Processed data.
•Knowledge – What we know.
•Types of Information
•Text
•Images
•Audio
•Video
•Source Code
•Applications/Web services
•XML and structured documents
Definition of IR

Defining Data, Information, Knowledge & Wisdom

Definition of IR
•Goal
Find the documents most relevant to user Query.
•Information Retrieval (IR)
Informationretrieval(IR)maybedefinedasasoftwareprogramthat
dealswiththeorganization,storage,retrievalandevaluationof
informationfromdocumentrepositoriesparticularlytextualinformation.

Data Retrieval Vs Information Retrieval
Data Retrieval Information Retrieval
•Retrieves data based on the keywords in the query
entered by the user.
•Retrieves information based on the similarity
between the query and the document.
•There is no room for errors since it results in
complete system failure.
•Small errors are tolerated and will likely go
unnoticed.
•It has a defined structure with respect to
semantics.
•It is ambiguous and doesn’t have a defined
structure.
•Provides solutions to the user of the database
system.
•Does not provide a solution to the user of the
database system.
•Data Retrieval system produces exact results.•Information Retrieval system produces
approximate results
•Displayed results are not sorted by relevance. •Displayed results are sorted by relevance
•Eg : SQL •Eg : Google Search Engine

Text mining and Information Retrieval (IR)
Text mining is a process of extracting useful information and patterns from
a large volume of text databases.

IR SystemBlock Diagram
Fig : Typical IR System (Black Box) Fig : Information Retrieval (IR) Process

Evaluation Criteria
•Recall – is defined as the portion of the total relevant document that is
retrieved.
Recall =
No of Relevant document retrieved
* 100
??????&#3627408476;&#3627408481;??????&#3627408473; &#3627408475;&#3627408476; &#3627408476;&#3627408467; &#3627408479;&#3627408466;&#3627408473;&#3627408466;&#3627408483;??????&#3627408475;&#3627408481; &#3627408465;&#3627408476;&#3627408464;&#3627408482;&#3627408474;&#3627408466;&#3627408475;&#3627408481;&#3627408480; ??????&#3627408475; &#3627408481;ℎ&#3627408466; &#3627408464;&#3627408476;&#3627408473;&#3627408473;&#3627408466;&#3627408464;&#3627408481;??????&#3627408476;&#3627408475;
•Precision - is defined as the portion of the document retrieved that is
relevant.
Precision =
No of Relevant document retrieved
* 100
??????&#3627408476;&#3627408481;??????&#3627408473; &#3627408475;&#3627408476; &#3627408476;&#3627408467; &#3627408465;&#3627408476;&#3627408464;&#3627408482;&#3627408474;&#3627408466;&#3627408475;&#3627408481;&#3627408480; &#3627408479;&#3627408466;&#3627408481;&#3627408479;??????&#3627408466;&#3627408483;&#3627408466;&#3627408465;

Automatic Text Analysis
1.Document Representative
2.Text Summarization
3.Luhn’s Idea
Document
Document
Representative
Predictions from
Frequency of
Words
Conflation
Algorithm

Luhn’s Idea
Stop
words
TheLuhn’sIdeaSays:
->Toolowfrequentwordsarenotsignificant.
->Toohighfrequentwordsarealsonotsignificant
(e.g.“is”,“and”).
->Removinglowfrequentwordsiseasy.
--Setaminimumfrequency-threshold
->Removingcommon(highfrequent)words:
--Settingamaximumfrequencythreshold
(statisticallyobtained)
--Comparingtoacommon-wordlist
->Usedforsummarizingtechnicaldocuments.

Conflation Algorithm
1.Open and read each input file and create a single index file.
2.Remove high frequency words (stop words) .
3.Remove all suffixes/affixes from each word if present.
4.Detecting equivalent stems.
5.Store in index file.
{Compute, Computer, Computing} → Comput
{Walks, Walking, Walker} → Walk
{develop, developing, development, developments } → develop

High frequency words

Indexing Subsystem

Clustering in Information Retrieval
Medical
Legal
Financial
Documents Collection

Clustering in Information Retrieval
Similarity matrix
Objects: {1,2,3,4,5,6}
Threshold: .89
Graph Theoretic Approach
C1 :{1,4,5,6}
C2 :{2}
C3 : {3}

Similarity Measures

Jaccard’s Similarity Example

Single Pass Clustering Algorithm
1.Assign the first document D1 as the representative for C1.
2.For Di, calculate the similarity S with the representative for each existing cluster.
3.If Smax is greater than a threshold value ST, add the item to the corresponding cluster
and recalculate the cluster representative; otherwise, use Di to initiate a new cluster.
4.If an item Di remains to be clustered, return to step 2.

Example of Single Pass Clustering Technique
Suppose that we have the following set of documents and terms, and
that we are interested in clustering the terms using the single pass
method. Threshold value is 10

Example of Single Pass Clustering Technique

Single Link Clustering Algorithm
Dissimilarity
matrix:

Thanks

Unit 1 Information Storage and Retrieval

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Unit 1 Information Storage and Retrieval

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......