Information Retrieval Models

4,340 views 24 slides Mar 05, 2020
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Describes various information retrieval models


Slide Content

Chapter 2Modeling
資工4B 86075800
陳建勳

Introduction.
Traditional information retrieval systems
usually adopt index termsto index and retrieve
documents.
An index term is a keyword(or group of related
words) which has some meaning of its own
(usually a noun).

The advantage of using index
terms
Simple
The semantic of the documents and of the
user information need can be naturally
expressed through sets of index terms.
Ranking algorithmsare at the core of information
retrieval systems(predicting which documents are
relevantand which are not).

A taxonomy of information retrieval
models
Retrieval:
Ad hoc
Filtering
Classic Models
Browsing
U
S
E
R
T
A
S
K
Boolean
Vector
Probabilistic
Structured Models
Non-overlapping lists
Proximal Nodes
Flat
Structured Guided
Hypertext
Browsing
Fuzzy
Extended Boolean
Set Theoretic
Algebraic
Generalized Vector
Lat. Semantic Index
Neural Networks
Inference Network
Belief Network
Probabilistic

Index TermsFull TextFull Text+
Structure
RetrievalClassic
Set Theoretic
Algebraic
Probabilistic
Classic
Set
Theoretic
Algebraic
Probabilistic
Structured
BrowsingFlat Flat
Hypertext
Structure Guided
Hypertext
Figure 2.2Retrieval models most frequently associated with distinct
combinations of a document logical view and a user task.

Retrieval : Ad hocand Filtering
Ad hoc : The documents in the collection
remain relatively static while new queries
are submtted to the system.
Filtering : The queries remain relatively
static while new documents come into the
system

Filtering
Typically, the filtering task simply
indicates to the user the documents
which might be of interest to him.
Routing : Rank the filtering documents
and show this ranking to the user.
Constructing user profiles in two ways.

A formal characterization of IR models
D: A set composed of logical views(or
representation) for the documents in the
collection.
Q: A set composed of logical views(or
representation) for the user information
needs(queries).
F: A framework for modeling document
representations, queries, and their relationships.
R(q
i, d
j): A ranking function which defines an
ordering among the documents with regard to the
query.

Classic information retrieval
model
Basic concepts : Each document is
described by a set of representative
keywords called index terms.
Assign a numerical weights to distinct
relevance between index terms.

Define
k
i: A generic index term
K : The set of all index terms {k
1,…,k
t}
w
i,j: A weight associated with index term
k
iof a document d
j
g
i: A function returns the weight associated
with k
i in any t-dimensoinal vector( g
i(d
j)=w
i,j )

Boolean model
Based on a binary decision criterion without any
notion of a grading scale.
Boolean expressions have precise semantics.It is
not simple to translate an information need into
a Boolean expression.
Can be represented as a disjunction of
conjunction vectors(in disjunctive normal form-
DNF).

Vector model
Assign non-binary weights to index
terms in queries and in documents.
Compute the similarity between
documents and query.
More precise than Boolean model.

想法
We think of the documents as a collection C
of objects and think of the user query as a
specification of a set A of objects.In this
scenario, the IR problem can be reduced to
the problem of determine which documents
are in the set A and which ones are not(i.e.,
the IR problem can be viewed as a
clustering problem).

Intra-cluster : One needs to determine
what are the features which better
describe the objects in the set A.
Inter-cluster : One needs to determine
what are the features which better
distinguish the objects in the set A.

tf : inter-clustering similarity is quantified by
measuring the raw frequency of a term k
i
inside a document d
j, such term frequencyis
usually referred to as the tf factor and
provides one measure of how well that term
describes the document contents.
idf : inter-clustering similarity is quantified by
measuring the inverse of the frequency of a
term k
iamong the documents in the
collection.This frequency is often referred to
as the inverse document frequency.

Vector model is simple and fast. It’s a
popular retrieval model.
Disadvantage : Index terms are
assumed to be mutually independent. It
doesn’t account for index term
dependencies.

Probabilistic model
We can think of the querying process
as a process of specifying the properties
of an ideal answer set(The problem is
that we do not know exactly what these
properties are.).

Structured text retrieval model
Retrieval models which combine information on
text content with information on the document
structure are called structured text retrieval
model.
Match point: refer to the position in the text
of a sequence of words which matches the user
query.
Region: refer to a contiguous portion of the
text.
Node: refer to a structural component of the
document such as a chapter, a section, a
subsection.

Model based on Non-overlapping
lists
Divide the whole text of each document
in non-overlapping text regions which
are collected in a list.
Text regions in the same list have no
overlapping, but text regions from
distinct lists might overlap.

Model based on Proximal
nodes
A model which allows the definition of
independent hierarchical indexing
structures over the same document text.
Each of these index structures is a strict
hierarchy composed of chapters,
sections, paragraphs, pages, and lines
which called nodes.

Models for browsing
Flat browsing
Structure guided browsing
The hypertext model

Flat browsing
The documents might be represented
as dots in a plan or as elements in a list.
Relevance feedback
Disadvantage : In a given page or
screen there may not be any indication
about the context where the user is.

Structure guided browsing
Organized in a directory structure. It
groups documents covering related
topics.
The same idea can be applied to a
single document.
Using history map.

The hypertext model
Written text is usually conceived to be
read sequentially.
The reader should not expect to fully
understand the message conveyed by
the writer by randomly reading pieces
of text here and there.
Tags