2
Keyword-based querying
•A query is an expression of users information
needs.
–IR queries are distinctive in that they are unstructured
and often ambiguous; they differ from standard query
languages which are governed by strict syntax rules.
•Queries are combinations of words.
–The document collection is searched for documents that
contain these words.
•Word queries are intuitive, easy to express and
provide fast ranking.
3
Single-word queries
•A query is a single word
–Usually used for searching in document images
•Simplest form of query.
•What are the possible documents retrieved as
relevant?
–All documents that include the query word are
retrieved.
•On what base documents are ranked?
–Documents may be ranked by the frequency count of
the query word in the document.
–Documents containing more of the query word are
given the highest priority
4
Phrase queries
•A query is a sequence of words treated as a single unit.
Also called “literal string” or “exact phrase” query.
–Phrase is usually surrounded by quotation marks.
–All documents that include this phrase are retrieved.
•Usually, separators (commas, colons, ...) & common words
(“a”, “the”, “of”, “for”…) in the phrase are ignored
•In effect, this query is for a set of words that must appear as
per the given sequence.
–Allows users to specify a context and thus gain precision.
–Ex.: “Information Processing for Document Retrieval”.
•What are the possible documents retrieved as relevant?
–All documents that include phrase query are retrieved.
•On what base documents are ranked?
–Documents may be ranked by the frequency of appearance of
the phrase query in the document.
5
Multiple-word queries
•A query is a set of words (or a set of phrases).
–Ex.: what is the result for the query “Data Mining and Intelligent
Database Design”?
•What are the possible documents retrieved as relevant?
–A document is retrieved if it includes one or more of the
query words.
•On what bases documents be ranked to list according to
best matching principle?
–Documents are ranked by the number of query words they
contain. A document containing n query terms is ranked
higher than a document containing m < n query words.
– Documents are ranked in decreasing order:
•those containing all the query words are ranked at the top, only one
query word at bottom.
–Frequency counts may be used to break ties among
documents that contain the same query words.
6
Boolean queries
•Queries are formulated based on concepts from logic:
AND, OR, NOT
–It describes the information needed by relating multiple words
with Boolean operators.
•Semantics: For each query word w a corresponding set D
w
is constructed that includes the documents that contain w.
•The Boolean expression is then interpreted as an
expression on the corresponding document sets with
corresponding set operators:
–AND: Finds only documents containing all of the specified
words or phrases.
–OR: Finds documents containing at least one of the specified
words or phrases.
–NOT: Excludes documents containing the specified word or
phrase.
7
Examples: Boolean queries
1.computer OR server
–Finds documents containing either computer, server or both
2. (computer OR server) NOT mainframe
–Select all documents that discuss computers or servers, do
not select any documents that discuss mainframes.
3. computer OR server NOT mainframe
–Select all documents that discuss computers, or documents
that discuss servers but do not discuss mainframes.
4. computer NOT (server OR mainframe)
–Select all documents that discuss computers, and do not
discuss either servers or mainframes.
8
Weighted queries
•Each of the words is assigned a different weight,
expressing the relative importance of the word within
the query.
•A query is then a set of word-weight pairs:
(q
1, w
1), (q
2, w
2), …, (q
n, w
n).
•The ranking of a document is the sum of the weights
for the query words that it satisfies.
• Example: given Query: (A,0.8), (B,0.9), (C,0.3); and
Document 1: (A, B, D) & Document 2: (A, C, D) which
document ranked first ?
–Score of Document 1: 0.8+0.9 = 1.7
–Score of Document 2: 0.8+0.3 = 1.1
–Each document includes two words from the query, but
Document1 is ranked higher because it includes more
important words.
9
Natural language
•Using natural language for querying is very
attractive.
•Example: Find all the documents that discuss
“campaign finance reforms, including documents that
discuss violations of campaign financing regulations. Do
not include documents that discuss campaign
contributions by the gun and the tobacco industries”.
•Natural language queries are converted to a formal
language for processing against a set of
documents.
•Such translation requires intelligence and is still a
challenge
Problems with Keywords
May not retrieve relevant documents that
include synonymy terms.
◦“restaurant” vs. “café”
May retrieve irrelevant documents that
include polysomy terms.
◦“Apple” (company vs. fruit vs. sport club)
◦“bit” (unit of data vs. act of eating)
◦“bat” (baseball vs. mammal)
13
Introduction
No detailed knowledge of collection and searching
environment
édifficult to formulate queries well designed for searching
éneed many formulations of queries for effective searching
First formulation: often inexperienced attempt to
retrieve relevant information
Documents initially retrieved:
écan be examined for relevance information (by the user or
automatically by the system) to provide relevance
feedback
é improve query formulations for retrieving additional
relevant documents (using query reformulation
techniques)
14
Query Reformulation
•Identify terms related to query terms
•Revise query to account for feedback using two
basic techniques:
–Query Expansion: Add new terms related to query terms
from relevant documents.
–Term Reweighting: modify term weights based on
documents relevance for the users query
•Increase weight of terms in relevant documents and decrease
weight of terms in irrelevant documents such that:
–Reformulated query: Closer to term weight vectors of
relevant documents
15
Approaches for Relevance Feedback
•Users relevance feedback
Most popular query reformulation strategy.
Approaches based on feedback from users about
relevance of documents retrieved
•Pseudo-relevance feedback
Approaches based on information derived from set
of initially retrieved documents (local set of
documents), which is called Local Analysis
Approaches based on global information derived
from document collection, which is called Global
Analysis
16
Users Relevance Feedback
•After initial searching results are presented, allow users to
provide feedback on the relevance of one or more of the
retrieved documents.
•Cycle:
–User presented with list of retrieved documents
–User marks those which are relevant
•In practice: top 10-20 ranked documents are examined
–Use this feedback information to reformulate the query, i.e.
•To select important terms from documents assessed
relevant by the users & add these terms to the reformulated
new query
–Produce new results based on reformulated query.
–Allows more interactive, multi-pass process.
•Expected: New query moves towards relevant documents
and away from non-relevant documents
18
Pseudo Relevance Feedback
•Use relevance feedback methods without explicit user
input.
•Steps
–Obtain relevance feedback automatically.
•Just assume the top K retrieved documents are relevant, &
use them to reformulate the query.
–Allow for query expansion that includes terms from
documents that are correlated with the query terms.
•Identify terms related to query terms (such as synonyms terms
and/or similar terms that are close to query terms in text)
•Two strategies
–Local strategies
–Global strategies
20
Local Analysis
•Examine only documents retrieved automatically for
query to determine query expansion
–Base correlation analysis on only the “local” set of
retrieved documents for a specific query.
•At query time, dynamically determine similar terms
based on analysis of K top-ranked retrieved
documents.
•Avoids ambiguity by determining similar (related)
terms only within relevant documents.
–“Apple computer” “Apple computer Powerbook
laptop”
21
Global analysis
•Expand query using information from whole set of
documents in collection
–Determine term similarity through a pre-computed statistical
analysis of the complete corpus.
•Thesaurus-like structure using all documents
–Approach to automatically built thesaurus
For example: similarity thesaurus based on co-occurrence
frequency
•A thesaurus provides information on synonyms and
semantically related words and phrases.
•Example:
physician
similar/synonymous: doctor, medical, MD
related: general practitioner, surgeon