Model of information retrieval (3)

9866825059 11,268 views 20 slides Sep 29, 2013
Slide 1
Slide 1 of 20
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20

About This Presentation

No description available for this slideshow.


Slide Content

MODEL OF INFORMATION RETRIEVAL BY N. SUMANJALI DPT OF LIS PONDICHERRY UNIVERSITY

INFORMATION RETRIEVAL Information retrieval is the activity of obtaining information resources relevant to an information need from a collection of information resources . Searches can be based on metadata or on full-text (or other content-based) indexing . Goal: Find the documents most relevant to a certain Query Dealing with notions of: Collection of documents Query (User’s information need) Notion of Relevancy

MODEL A model is a construct designed help us understand a complex system A particular way of “looking at things” Models inevitably make simplifying assumptions What are the limitations of the model? Different types of models: Conceptual models Physical analog models Mathematical models

Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion of relevance can be binary or continuous (i.e. ranked retrieval ).

CLASSES OF RM Boolean models (set theoretic) Extended Boolean Vector space models (statistical/algebraic) Generalized VS Latent Semantic Indexing Probabilistic models

MODELS OF IR Boolean model Based on the notion of sets Documents are retrieved only if they satisfy Boolean conditions specified in the query Does not impose a ranking on retrieved documents Exact match Vector space model Based on geometry, the notion of vectors in high dimensional space Documents are ranked based on their similarity to the query (ranked retrieval) Best/partial match

Language models Based on the notion of probabilities and processes for generating text Documents are ranked based on the probability that they generated the query Best/partial match

BOOLEAN MODEL Invented by George Boole (1815-1864) He devised a system of symbolic logic in which he used three operators (+, , - ) to combine statements in symbolic form. John Venn named to this operators of Boolean logic are the logical sum(+), logical product(), and logical difference(-). IR systems allow the users to express their queries by using this operators.

BOOLEAN MODEL Each index term is either present or absent Documents are either Relevant or Not Relevant (no ranking) A document is represented as a set of keywords. Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope. [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton] Output: Document is relevant or not. No partial matches or ranking.

BOOLEAN RETRIEVAL MODEL Popular retrieval model because: Easy to understand for simple queries. Clean formalism. Boolean models can be extended to include ranking. Reasonably efficient implementations possible for normal queries.

BOOLEAN MODEL Weights assigned to terms are either “0” or “1” “0” represents “absence”: term isn’t in the document “1” represents “presence”: term is in the document Build queries by combining terms with Boolean operators AND, OR, NOT The system returns all documents that satisfy the query

AND/OR/NOT A B C

Why Boolean Retrieval Works Boolean operators approximate natural language Find documents about a good party that is not over AND can discover relationships between concepts good party OR can discover alternate terminology excellent party, wild party, etc. NOT can discover alternate meanings Democratic party

The Perfect Query Paradox Every information need has a perfect set of documents If not, there would be no sense doing retrieval Every document set has a perfect query AND every word in a document to get a query for it Repeat for each document in the set OR every document query to get the set query But can users realistically be expected to formulate this perfect query? Boolean query formulation is hard!

Why Boolean Retrieval Fails Natural language is way more complex AND “discovers” nonexistent relationships Terms in different sentences, paragraphs, … Guessing terminology for OR is hard good, nice, excellent, outstanding, awesome, … Guessing terms to exclude is even harder! Democratic party, party to a lawsuit, …

BOOLEAN MODEL Strengths Precise, if you know the right strategies Precise, if you have an idea of what you’re looking for Efficient for the computer Simple Weaknesses Users must learn Boolean logic Boolean logic insufficient to capture the richness of language No control over size of result set: either too many documents or none When do you stop reading? All documents in the result set are considered “equally good” What about partial matches? Documents that “don’t quite match” the query may be useful also No notion of ranking (exact matching only) All index terms have equal weight

PROBLEMS Very rigid: AND means all; OR means any. Difficult to express complex user requests. Difficult to control the number of documents retrieved. All matched documents will be returned. Difficult to rank output. All matched documents logically satisfy the query. Difficult to perform relevance feedback. If a document is identified by the user as relevant or irrelevant, how should the query be modified?

ADVANTAGES & DISADVANTAGES Advantages Results are predictable, relatively easy to explain Many different features can be incorporated Efficient processing since many documents can be eliminated from search Disadvantages Effectiveness depends entirely on user Simple queries usually don’t work well Complex queries are difficult.

LIMITATIONS The first relates to the formulation of search statements. It has been noted that users are not able to formulate an exact search statement by the combination of AND, OR and NOT operators, especially when several query terms are involved. In such cases either the search statement becomes too narrow or too broad. The second limitation relates to the number of retrieval items. It has been noted that users cannot predict a priori exactly how many items are to be retrieved to satisfy a given query. If the search statement is broad, the number of retrieved items may sometimes be several hundreds and thus it may be quite difficult to find out the exact information required. The third limitation is that it identifies an item as relevant by finding out whether a given query term is present or not in a given record in the database.
Tags