unit 1 INTRODUCTION

karthiksmart21 17 views 42 slides Sep 27, 2024
Slide 1
Slide 1 of 42
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42

About This Presentation

• To understand the basics of Information Retrieval


Slide Content

UNIT I
INTRODUCTION
1. INTRODUCTION
What is Information?
There is no “correct” definition
Cookie Monster’s definition:
o“news or facts about something”
Information:
Informing, telling; thing told, knowledge, items of knowledge, news
Knowledge communicated or received concerning a particular fact or circumstance;
news
Knowledge: knowing familiarity gained by experience; person’s range of information; a
theoretical or practical understanding of; the sum of what is known
Three Views of Information
oInformation as process
oInformation as communication
oInformation as message transmission and reception
Types of information
o Text (Documents and portions thereof)
o XML and structured documents
o Images
o Audio (sound effects, songs, etc.)
o Video
o Source code
oApplications/Web services
Retrieval?
“Fetch something” that’s been stored
oRecover a stored state of knowledge
Search through stored messages to find some messages relevant to the task at hand.
What is IR?
 Information retrieval is a problem-oriented discipline, concerned with the problem of
the effective and efficient transfer of desired information between human generator and human
user

According to Calvin Mooers(1951) definition:
“Information retrieval (IR) embraces the intellectual aspects of the description of information
and its specification for search, and also whatever systems, techniques, or machines are
employed to carry out the operation.”
oInformation retrieval is the activity of obtaining information resources relevant to an
information need from a collection of information resources.
oAn information retrieval process begins when a user enters a query into the system.
Queries are formal statements of information needs.
oUser queries are matched against the database information. Depending on the
application the data objects may be, for example, text documents, images, audio, mind
maps or videos.
oMost IR systems compute a numeric score on how well each object in the database
matches the query, and rank the objects according to this value.
oThe top ranking objects are then shown to the user. The process may then be iterated if
the user wishes to refine the query.
Main Objective of IR:
Provide the users with effective access to & interaction with information resources.
o Goal of IR is to retrieve all and only the “relevant” documents in a collection for a
particular user with a particular need for information
Relevance is a central concept in IR theory
How does an IR system work when the “collection” is all documents available on the Web?
Web search engines have been stress-testing the traditional IR models (and inventing
new ways of ranking)
oThe goal is to search large document collections (millions of documents) to retrieve
small subsets relevant to the user’s information need
oExamples are:
oInternet search engines (Google, Yahoo! web search, etc.)
oDigital library catalogues (MELVYL, GLADYS)
What do we want from an IRS ?
oSystemic approach
Goal (for a known information need):Return as many relevant documents as
possible and as few non-relevant documents as possible
oCognitive approach
Goal (in an interactive information-seeking environment, with a given IRS):
Support the user’s exploration of the problem domain and the task
completion.

The role of an IR system
oSupport the user in
exploring a problem domain, understanding its terminology, concepts and
structure
clarifying, refining and formulating an information need
finding documents that match the info need description
As many relevant docs as possible
As few non-relevant documents as possible
Some application areas within IR
oCross language retrieval
oSpeech/broadcast retrieval
oText categorization
oText summarization
oStructured Document Element retrieval (XML)
Information Retrieval vs. Information Extraction
Information Retrieval
Given a set of terms and a set of document terms select only the most relevant document
(precision), and preferably all the relevant ones (recall)
Information Extraction
Extract from the text what the document means.
Databases vs. IR
Databases IR
What we are Structured data Mostly unstructured
retrieving
Queries we are Formally defined queries,Expressed in natural
posing unambiguous language
Results we get Exact. Always correct inSometimes relevant,
formal sense. often not
Interaction withOne-short queries Interaction is important
system

Performance and correctness measures
Precision
Precision is the fraction of the documents retrieved that are relevant to the user’s information need.
Recall
Recall is the fraction of the documents that are relevant to the query that are successfully
retrieved
Fall-out
The proportion of non-relevant documents that are retrieved, out of all non-relevant
documents available
F-score / F-measure
The weighted harmonic mean of precision and recall, the traditional F-measure or balanced
F-score is:
1.1 Information Retrieval:
IR deals with the representation, storage, organization of, and access to information items
Types of information items: documents, Web pages, online catalogs, structured records,
multimedia objects
Early goals of the IR area: indexing text and searching for useful documents in a collection
Nowadays, research in IR includes: Modeling, Web search, text classification, systems
architecture, user interfaces, data visualization, filtering and languages.

1.2. Early Developments:
For more than 5,000 years, man has organized information for later retrieval and searching
This has been done by compiling, storing, organizing, and indexing papyrus, hieroglyphics, and
books
For holding the various items, special purpose buildings called libraries, or bibliothekes, are used
-The oldest known library was created in Elba, in the Fertile Crescent, between 3,000 and
2,500 BC
-By 300 BC, Ptolemy Soter, a Macedonian general, created the Great Library at Alexandria
-Nowadays, libraries are everywhere
In 2008, more than 2 billion items were checked out from libraries in the US—an increase of 10%
over the previous year
Since the volume of information in libraries is always growing, it is necessary to build specialized
data structures for fast search — the indexes
For centuries indexes have been created manually as sets of categories, with labels associated with
each category
The advent of modern computers has allowed the construction of large indexes automatically
During the 50’s, research efforts in IR were initiated by pioneers such as Hans Peter Luhn, Eugene
Garfield, Philip Bagley, and Calvin Moores, who allegedly coined the term Information Retrieval
In 1962, Cyril Cleverdon published the Cranfield studies on retrieval evaluation
In 1963, Joseph Becker and Robert Hayes published the first book on IR
In the late 60’s, key research conducted by Karen Sparck Jones and Gerard Salton, among others,
led to the definition of the TF-IDF term weighting scheme
In 1971, Jardine and van Rijsbergen articulated the cluster hypothesis
In 1978, the first ACM SIGIR Internation Conference on Information Retrieval was held in
Rochester
In 1979, van Rijsbergen published a classic book entitled Information Retrieval, which focused on
the Probabilistic Model
In 1983, Salton and McGill published a classic book entitled Introduction to Modern Information
Retrieval, which focused on the Vector Model

1.3 The IR Problem:
Users of modern IR systems, such as search engine users, have information needs of varying
complexity
An example of complex information need is as follows:
o“ Find all documents that address the role of the Federal Government in financing the
operation of the National Railroad Transportation Corporation (AMTRAK)”
This full description of the user information need is not necessarily a good query to be submitted
to the IR system
Instead, the user might want to first translate this information need into a query
This translation process yields a set of keywords, or index terms, which summarize the user
information need
Given the user query, the key goal of the IR system is to retrieve information that is useful or
relevant to the user
That is, the IR system must rank the information items according to a degree of relevance to the
user query
The IR Problem
o“The key goal of an IR system is to retrieve all the items that are relevant to a user query,
while retrieving as few non relevant items as possible”.
The notion of relevance is of central importance in IR
1.4 The Users Task
Consider a user who seeks information on a topic of their interest
oThis user first translates their information need into a query, which requires specifying
the words that compose the query
oIn this case, we say that the user is searching or querying for information of their
interest
Consider now a user who has an interest that is either poorly defined or inherently broad
oFor instance, the user has an interest in car racing and wants to browse documents on Formula
1 and Formula Indy
oIn this case, we say that the user is browsing or navigating the documents of the collection

The User Task: 
The information first is supposed to be translated into a query by the user. In the information
retrieval system, there is a set of words that convey the semantics of the information that is required
whereas, in a data retrieval system, a query expression is used to convey the constraints which are
satisfied by the objects.
Example: A user wants to search for something but ends up searching with another thing. This
means that the user is browsing and not searching. The above figure shows the interaction of the user
through different tasks.
Logical View of the Documents: A long time ago, documents were represented through a set of
index terms or keywords. Nowadays, modern computers represent documents by a full set of words
which reduces the set of representative keywords. This can be done by eliminating stopwords i.e.
articles and connectives. These operations are text operations. These text operations reduce the
complexity of the document representation from full text to set of index terms.
1.5 Information Retrievel Vs Data Retrievel:
Information Retrieval: Given a set of query terms and a set of document terms select only
the most relevant documents [precision], and preferably all the relevant [recall].
Data retrieval: the task of determining which documents of a collection contain the
keywords in the user query
Data retrieval system
oEx: relational databases
oDeals with data that has a well defined structure and semantics
oA single erroneous object among a thousand retrieved objects means total failure
Data retrieval does not solve the problem of retrieving information about a subject or
topic

Information Retrieval Data retrieval
The software program that deals with the
organization, storage, retrieval, and evaluation of
information from document repositories particularly
textual information.
Data retrieval deals with obtaining data from a
database management system such as ODBMS. It is
A process of identifying and retrieving the data
from the database, based on the query provided by
user or application.
Retrieves information about a subject. Determines the keywords in the user query and
retrieves the data.
Small errors are likely to go unnoticed.A single error object means total failure.
Not always well structured and is semantically
ambiguous.
Has a well-defined structure and semantics.
Does not provide a solution to the user of the
database system.
Provides solutions to the user of the database
system.
The results obtained are approximate matches.The results obtained are exact matches.
Results are ordered by relevance. Results are unordered by relevance.
It is a probabilistic model. It is a deterministic model.
1.6 The IR System
It has three major components in IR
1.Document subsystem
a)Acquisition
b)Representation
c)File organization
2.User sub system
a)Problem
b)Representation
c)Query
3.Searching /Retrieval subsystem
a)Matching
b)Retrieved objects
An information retrieval system thus has three major components- the document subsystem, the
users subsystem, and the searching/retrieval subsystem.
These divisions are quite broad and each one is designed to serve one or more functions, such
as:
Analysis of documents and organization of information(creation of a
document database)
Analysis of user’s queries, preparation of a strategy to search the database
Actual searching or matching of users queries with the database, and finally

Retrieval of items that fully or partially match the search statement.
Traditional IR System
Acquisition (Document subsystem)
Selection of documents & other objects from various web resources
Mostly text based documents
*full texts, titles, abstracts ...
*but also other objects:
data, statistics, images, maps, trade marks, sounds ...
The data are collected by web crawler and stored in data base.
Representation of documents, objects(document subsystem)
Indexing – many ways :
*free text terms (even in full texts)
*controlled vocabulary - thesaurus
*manual & automatic techniques
Abstracting; summarizing
Bibliographic description:
*author, title, sources, date…
*metadata
Classifying, clustering

Organizing in fields & limits
*Basic Index, Additional Index. Limits
File organization (Document subsystem)
Sequential
*record (document) by record
Inverted
*term by term; list of records under each term
Combination
indexes inverted, documents sequential
When citation retrieved only, need for document files
Large file approaches for efficient retrieval by computers
Problem (user subsystem)
Related to user’s task, situation
*vary in specificity, clarity
Produces information need
*ultimate criterion for effectiveness of retrieval
how well was the need met?
Information need for the same problem may change, evolve, shift during the IR process
-adjustment in searching
*often more than one search for same problem over time
you will experience this in your term project
Representation (user subsystem)
Converting a concept to query.
What we search for.
These are stemmed and corrected using dictionary.
Focus toward a good result
Subject to feedback changes
Query - search statement (user & system)
Translation into systems requirements & limits
*start of human-computer interaction
query is the thing that goes into the computer

Selection of files, resources
Search strategy - selection of:
*search terms & logic
*possible fields, delimiters
*controlled & uncontrolled vocabulary
*variations in effectiveness tactics
Reiterations from feedback
*several feedback types: relevance feedback, magnitude feedback..
*query expansion & modification
Matching - searching (Searching subsystem)
Process of matching, comparing
*search: what documents in the file match the query as stated?
Various search algorithms:
*exact match - Boolean
still available in most, if not all systems
*best match - ranking by relevance
increasingly used e.g. on the web
*hybrids incorporating both
e.g. Target, Rank in DIALOG
Each has strengths, weaknesses
*no ‘perfect’ method exists
and probably never will
Retrieved documents -from system to user (IR Subsystem)
Various order of output:
*Last In First Out (LIFO); sorted
*ranked by relevance
*ranked by other characteristics
Various forms of output
When citations only: possible links to document delivery
Base for relevance, utility evaluation by users

Relevance feedback
High level software architecture of an IR system
1.7. The Software Architecture of the IR System

1.8. The Retrieval and Ranking Processes
The processes of indexing, retrieval, and ranking
Text Operations forms index words (tokens).
oTokenization – Given a character sequence and a defined document unit, tokenization is the
task of chopping it up into pieces called tokens.
oStopword removal – Remove non-informative or common words(tokens) from stream. E.g.
is,was,and, it, a etc.
oStemming – Replace the word variants with single stem of word. E.g. education,
educated, educate are replaced with single stem of word educate.

Indexing : Documents are converted into fast searchable internal representation using language
independent data structure called Inverted Index.
Searching : Calculate degree of similarity between document and query terms; retrieves
documents that contain a given query token from the inverted index.
Ranking : Scores all retrieved documents according to a relevance metric( term frequency or
Cosine similarity)
User Interface manages interaction with the user:
–Query input and document output.
–Relevance feedback.
–Visualization of results.
Query Operations transform the query to improve retrieval:
–Query expansion using a thesaurus.(vocabulary/terms); thesaurus is a data structure that
defines semantic relatedness between words e.g. Semantic related words are car, auto,
automobile and vehicle
–Query transformation using relevance feedback (the user gives feedback on the
relevance of document in an initial set of results)
Steps in Performing Information Retrieval
It can identify four distinct steps that a typical IR system must follow in order to be able to
fulfil its task. These are:
1. Document Gathering
This is the process of gathering the documents that are to form the core content of the IR system,
these documents could be text, images, audio files, video clips, entire movies, etc. If working with
a fixed and readily available set of documents, then this is simply a process of knowing the
location of each file on disk and gathering them before converting them into a searchable internal
representation (document indexing).
For example:
• Unnecessary mark-up of text may be removed.
• Many frequently occurring words that are of no benefit to the automatic retrieval
process may be removed. These words are called stopwords
• Terms within documents may be truncated to term stems (stemming).

2. Document Indexing
The documents gathered in the document gathering phase are converted into a fast searchable
internal representation.
This will usually be implemented using some programming language dependent data structures
which provide fast searching facilities such as array lists, vectors, sets, multi-sets, maps. Non-text
documents such as images, audio files, video clips will be indexed using some features which
support user searching.
3. Searching Support
This process involves accepting a query, processing it, finding possibly relevant documents,
calculating the degree of similarity between each document and the query for each (possibly
relevant2) document, sorting the set of highly ranked documents and returning these to the user in
groups (usually) of 10. All this has to be done as efficiently and quickly as possible. For example,
the IR system that operates as the Google search engine accepts and processes
150 million queries per day.
6.25 million per hour.
105,000 per minute.
1,700 per second.
4. Document Management
In the previous three steps, we have gathered documents, indexed them and are now allowing users
to search their content. However, in many scenarios such as web searching, the documents that
have been indexed will be unstable and constantly changing.
Consequently, we must validate that:
The documents that comprise the internal representation of the document collection are as
up-to-date as possible.
The documents included in the internal representation are actually still in existence.
Dimensions of IR
IR is more than just text, and more than just web search
although these are central
People doing IR work with different media, different types of search applications, and different
tasks

World Wide Web (web search) is the most common application involving information retrieval; search
is also a crucial part of applications in corporations, government, and many other domains.
Vertical Search is a specialized form of web search where the domain of the search is restricted to a
particular topic.
Enterprise Search involves finding the required information in the huge variety of computer files
scattered across a corporate intranet.
Desktop Search is the personal version of enterprise search, where the information sources are the files
stored on an individual computer, including email messages and web pages that have recently been
browsed.
Peer-to-peer search involves finding information in networks of nodes or computers without any
centralized control.
IR Tasks
oAd-hoc search- Find relevant documents for an arbitrary text query
oFiltering- Identify relevant user profiles for a new document
oClassification - Identify relevant labels for documents
oQuestion answering- Give a specific answer to a question
1.9. THE WEB:
At the end of World War II, Vannevar Bush looked for applications of new technologies to peace
times
Bush first produced a report entitled Science, The Endless Frontier
This report directly influenced the creation of the National Science Foundation
Following, he wrote As We May Think, a remarkable paper which discussed new hardware and
software gadgets

In Bush’s words: Whole new forms of encyclopedias will appear, ready-made with a mesh of
associative trails running through them, ready to be dropped into the memex and there amplified
As We May Think influenced people like Douglas Engelbart, who invented the computer mouse
and introduced the concept of hyperlinked texts
Ted Nelson, working in his Project Xanadu, pushed the concept further and coined the term
hypertext
A hypertext allows the reader to jump from one electronic document to another, which was one
important property regarding the problem that Tim Berners-Lee faced in 1989
At the time, Berners-Lee worked in Geneva at the CERN—Conseil Européen pour la Recherche
Nucléaire
There, researchers who wanted to share documentation with others had to reformat their
documents to make them compatible with an internal publishing system
Berners-Lee reasoned that it would be nice if the solution of sharing documents were decentralized
He saw that a networked hypertext would be a good solution and started working on its
implementation
In 1990, Berners-Lee
Wrote the HTTP protocol
Defined the HTML language
Wrote the first browser, which he called World Wide Web
Wrote the first Web server
In 1991, he made his browser and server software available in the Internet
The Web was born!
1.10. The E-Publishing Era
Since its inception, the Web became a huge success
Well over 20 billion pages are now available and accessible in the Web
More than one fourth of humanity now access the Web on a regular basis
Why is the Web such a success? What is the single most important characteristic of the
Web that makes it so revolutionary?
In search for an answer, let us dwell into the life of a writer who lived at the end of the

18th Century
She finished the first draft of her novel in 1796
The first attempt of publication was refused without a reading
The novel was only published 15 years later!
She got a flat fee of $110, which meant that she was not paid anything for the many
subsequent editions
Further, her authorship was anonymized under the reference “By a Lady”
Pride and Prejudice is the second or third best loved novel in the UK ever, after The
Lord of the Rings and Harry Potter
It has been the subject of six TV series and five film versions
The last of these, starring Keira Knightley and Matthew Macfadyen, grossed over
100 million dollars
Jane Austen published anonymously her entire life
Throughout the 20th century, her novels have never been out of print
Jane Austen was discriminated because there was no freedom to publish in the beginning of the
19th century
The Web, unleashed by the inventiveness of Tim Berners-Lee, changed this once and for all
It did so by universalizing freedom to publish
The Web moved mankind into a new era, into a new time, into The e-Publishing Era
1.11. How the Web Changed Search
Web search is today the most prominent application of IR and its techniques—the ranking and
indexing components of any search engine are fundamentally IR pieces of technology.
The first major impact of the Web on search is related to the characteristics of the document
collection itself.
The Web is composed of pages distributed over millions of sites and connected through
hyperlinks.
This requires collecting all documents and storing copies of them in a central repository,
prior to indexing.
This new phase in the IR process, introduced by the Web, is called crawling

The second major impact of the Web on search is related to:
The size of the collection
The volume of user queries submitted on a daily basis
As a consequence, performance and scalability have become critical characteristics
of the IR system
The third major impact : in a very large collection, predicting relevance is much harder
than before
Fortunately, the Web also includes new sources of evidence
 Ex: hyperlinks and user clicks in documents in the answer set
The fourth major impact derives from the fact that the Web is also a medium to do
business
Search problem has been extended beyond the seeking of text information to also
encompass other user needs
Ex: the price of a book, the phone number of a hotel, the link for downloading a software
The fifth major impact of the Web on search is Web spam
Web spam: abusive availability of commercial information disguised in the form of
informational content
This difficulty is so large that today we talk of Adversarial Web Retrieval
1.12. Practical Issues on the Web
1. Security
Commercial transactions over the Internet are not yet a completely safe procedure
2. Privacy
Frequently, people are willing to exchange information as long as it does not
become public
3. Copyright and patent rights
It is far from clear how the wide spread of data on the Web affects copyright and
patent laws in the various countries
4. Scanning, optical character recognition (OCR), and cross-language retrieval

1.13. How to People Search
Search tasks range from the relatively simple (e.g., looking up disputed facts or finding
weather information) to the rich and complex (e.g., job seeking and planning vacations).
Search interfaces should support a range of tasks, while taking into account how people think
about searching for information.
1.13.1. Information Lookup versus Exploratory Search
User interaction with search interfaces differs depending on
the type of task
the domain expertise of the information seeker
the amount of time and effort available to invest in the process
Marchionini makes a distinction between information lookup and exploratory search
Information lookup tasks
are akin to fact retrieval or question answering
can be satisfied by discrete pieces of information: numbers, dates, names, or Web
sites
can work well for standard Web search interactions
Exploratory search is divided into learning and investigating tasks
Learning search
i) requires more than single query-response pairs
ii) requires the searcher to spend time
scanning and reading multiple information items
synthesizing content to form new understanding
Investigating refers to a longer-term process which
involves multiple iterations that take place over perhaps very long periods of time
may return results that are critically assessed before being integrated into personal
and professional knowledge bases
may be concerned with finding a large proportion of the relevant information available
Information seeking can be seen as being part of a larger process referred to as
sensemaking
Sensemaking is an iterative process of formulating a conceptual representation from a large
collection

Russell et al. observe that most of the effort in sensemaking goes towards the synthesis of a good
representation
Some sensemaking activities interweave search throughout, while others consist of doing a
batch of search followed by a batch of analysis and synthesis.
Examples of deep analysis tasks that require sensemaking (in addition to search)
the legal discovery process
epidemiology(disease tracking)
studying customer complaints to improve service
Obtaining business intelligence.
1.13.2. The Classic versus the Dynamic Model of Information Seeking
Classic notion of the information seeking process:
1.problem identification
2.articulation of information need(s)
3.query formulation
4.results evaluation
More recent models emphasize the dynamic nature of the search process
The users learn as they search
Their information needs adjust as they see retrieval results and other document
surrogates
This dynamic process is sometimes referred to as the berry picking model of search
The rapid response times of today’s Web search engines allow searchers:
to look at the results that come back
to reformulate their query based on these results
This kind of behavior is a commonly-observed strategy within the berry-picking approach
Sometimes it is referred to as orienteering
Jansen et al made a analysis of search logs and found that the proportion of users who
modified queries is 52%
Some seeking models cast the process in terms of strategies and how choices for next steps are
made
In some cases, these models are meant to reflect conscious planning behavior by
expert searchers
In others, the models are meant to capture the less planned, potentially more reactive
behavior of a typical information seeker
1.13.3. Navigation versus Search
Navigation: the searcher looks at an information structure and browses among the

available information
This browsing strategy is preferable when the information structure is well-matched to the
user’s information need
it is mentally less taxing to recognize a piece of information than it is to recall it
it works well only so long as appropriate links are available
If the links are not available, then the browsing experience might be frustrating
Spool discusses an example of a user looking for a software driver for a particular laser printer
Say the user first clicks on printers, then laser printers, then the following sequence of
links:
HP laser printers
HP laser printers model 9750
software for HP laser printers model 9750 software drivers for
HP laser printers model 9750
software drivers for HP laser printers model 9750 for the Win98 operating system
This kind of interaction is acceptable when each refinement makes sense for the task at
hand
1.13.4. Search Process
Numerous studies have been made of people engaged in the search process
The results of these studies can help guide the design of search interfaces
One common observation is that users often reformulate their queries with slight
modifications
Another is that searchers often search for information that they have previously accessed
The users’ search strategies differ when searching over previously seen
materials
Researchers have developed search interfaces support both query history and revisitation
Studies also show that it is difficult for people to determine whether or not a document is
relevant to a topic
The less users know about a topic, the poorer judges they are of whether a search result
is relevant to that topic
Other studies found that searchers tend to look at only the top-ranked retrieved results
Further, they are biased towards thinking the top one or two results are better than those
beneath them
Studies also show that people are poor at estimating how much of the relevant material
they have found
Other studies have assessed the effects of knowledge of the search process itself

These studies have observed that experts use different strategies than novices searchers
For instance, Tabatabai et al found that
expert searchers were more patient than novices
this positive attitude led to better search outcomes
1.14. Search Interface Today
1.14.1. Getting Started
How does an information seeking session begin in online information systems?
The most common way is to use a Web search engine
Another method is to select a Web site from a personal collection of already-visited
sites
which are typically stored in a browser’s bookmark
Online bookmark systems are popular among a smaller segment of users
Ex: Delicious.com
Web directories are also used as a common starting point, but have been largely
replaced by search engines
1.14.2. Query Specification
The primary methods for a searcher to express their information need are either
entering words into a search entry form
selecting links from a directory or other information organization display
For Web search engines, the query is specified in textual form
Typically, Web queries today are very short consisting of one to three words
Short queries reflect the standard usage scenario in which the user tests the waters:
If the results do not look relevant, then the user reformulates their query
If the results are promising, then the user navigates to the most relevant-looking
Web site
This search behavior is a demonstration of the orienteering strategy of Web search
Before the Web, search systems regularly supported Boolean operators and command-based
syntax
However, these are often difficult for most users to understand
Jansen et al conducted a study over a Web log with 1.5M queries, and found that
2.1% of the queries contained Boolean operators
7.6% contained other query syntax, primarily double-quotation marks for phrases
White et al examined interaction logs of nearly 600,000 users, and found that
1.1% of the queries contained one or more operators
8.7% of the users used an operator at any time

Web ranking has gone through three major phases
In the first phase, from approximately 1994–2000:
Since the Web was much smaller then, complex queries were less likely to yield
relevant information
Further, pages retrieved not necessarily contained all query words
Around 1997, Google moved to conjunctive queries only
The other Web search engines followed, and conjunctive ranking became the norm
Google also added term proximity information and page importance scoring
(PageRank)
As the Web grew, longer queries posed as phrases started to
produce highly relevant results
1.14.3. Query Specification Interfaces
The standard interface for a textual query is a search box entry form
Studies suggest a relationship between query length and the width of the entry form
Results found that either small forms discourage long queries or wide forms
encourage longer queries
Some entry forms are followed by a form that filters the query in some way
For instance, at yelp.com, the user can refine the search by location using a second form
Notice that the yelp.com form also shows the user’s home location, if it has been specified
previously
Some search forms show hints on what kind of information should be entered into each

form
For instance, in zvents.com search, the first box is labeled “what are you looking for”?
The previous example also illustrates specialized input types that some search engines
are supporting today
The zvents.com site recognizes that words like “tomorrow” are time-
sensitive
It also allows flexibility in the syntax of dates
To illustrate, searching for “comedy on wed ” automatically computes the date for the
nearest future Wednesday
This is an example of how the interface can be designed to reflect how people think
Some interfaces show a list of query suggestions as the user types the query
This is referred to as auto-complete, auto-suggest, or dynamic query suggestions
Anick et al found that users clicked on dynamic Yahoo suggestions one third of
the time
Often the suggestions shown are those whose prefix matches the characters typed so far
However, in some cases, suggestions are shown that only have interior letters
matching
Further, suggestions may be shown that are synonyms of the words typed so far
Dynamic query suggestions, from Netflix.com

The dynamic query suggestions can be derived from several sources, including:
The user’s own query history
A set of metadata that a Web site’s designer considers important
All of the text contained within a Web site
Dynamic query suggestions, grouped by type, from NextBio.com:
1.14.4. Retrieval Result Display
When displaying search results, either
the documents must be shown in full, or else
the searcher must be presented with some kind of representation of the content of
those documents
The document surrogate refers to the information that summarizes the document
This information is a key part of the success of the search interface
The design of document surrogates is an active area of research and
experimentation
The quality of the surrogate can greatly effect the perceived relevance of the
search results listing
In Web search, the page title is usually shown prominently, along with the URL and other
metadata
In search over information collections, metadata such as date published and author are
often displayed
Text summary (or snippet) containing text extracted from the document is also
critical
Currently, the standard results display is a vertical list of textual summaries
This list is sometimes referred to as the SERP (Search Engine Results Page)

In some cases the summaries are excerpts drawn from the full text that contain the query
terms
In other cases, specialized kinds of metadata are shown in addition to standard textual
results
This technique is known as blended results or universal search
For example, a query on a term like “rainbow” may return sample images as one entry in
the results listing
A query on the name of a sports team might retrieve the latest game scores and a link to buy tickets
Nielsen notes that in some cases the information need is satisfied directly in the search
results listing

This makes the search engine an “answer engine”
Displaying the query terms in the context in which they appear in the document:
Improves the user’s ability to gauge the relevance of the results
It is sometimes referred to as KWIC - keywords in context
It is also known as query-biased summaries, query-oriented summaries, or user-directed
summaries
The visual effect of query term highlighting can also improve usability of search results listings
Highlighting can be shown both in document surrogates in the retrieval results
and in the retrieved documents
Determining which text to place in the summary, and how much text to show, is a
challenging problem
Often the summaries contain all the query terms in close proximity to one another
However, there is a trade-off between
Showing contiguous sentences, to aid in coherence in the result
Showing sentences that contain the query terms
Some results suggest that it is better to show full sentences rather than cut them off
On the other hand, very long sentences are usually not desirable in the results
listing
Further, the kind of information to display should vary according to the intent of the query
Longer results are deemed better than shorter ones for certain types of
information need
On the other hand, abbreviated listing is preferable for navigational queries
Similarly, requests for factual information can be satisfied with a concise results
display

Other kinds of document information can be usefully shown in the search results page
The page results below show figures extracted from journal articles alongside the search
results
1.14.5. Query Reformulation
There are tools to help users reformulate their query
One technique consists of showing terms related to the query or to the documents
retrieved in response to the query
A special case of this is spelling corrections or suggestions

Usually only one suggested alternative is shown: clicking on that alternative re-executes
the query
In years back, the search results were shown using the purportedly incorrect
spelling
Microsoft Live’s search results page for the query “IMF”
Term expansion: search interfaces are increasingly employing related term suggestions
Log studies suggest that term suggestions are a somewhat heavily-used feature in Web
search
Jansen et al made a log study and found that 8% of queries were generated from term
suggestions
Anick et al found that 6% of users who were exposed to term suggestions chose to click on them
Some query term suggestions are based on the entire search session of the particular user
Others are based on behavior of other users who have issued the same or similar queries in the past
One strategy is to show similar queries by other users Another is to extract
terms from documents that have been
clicked on in the past by searchers who issued the same query
Relevance feedback is another method whose goal is to aid in query reformulation
The main idea is to have the user indicate which documents are relevant to their query
In some variations, users also indicate which terms extracted from those
documents are relevant
The system then computes a new query from this information and shows a new retrieval set
Nonetheless, this method has not been found to be successful from a usability perspective
Because that, it does not appear in standard interfaces today
This stems from several factors:
People are not particularly good at judging document relevance, especially for topics with which

they are unfamiliar
The beneficial behavior of relevance feedback is inconsistent
1.14.6. Organizing Search results
Organizing results into meaningful groups can help users understand the results and decide
what to do next
Popular methods for grouping search results: category systems and clustering
Category system: meaningful labels organized in such a way as to reflect the concepts
relevant to a domain
Good category systems have the characteristics of being coherent and relatively
complete
Their structure is predictable and consistent across search results for an
information collection
The most commonly used category structures are flat, hierarchical, and faceted categories
Flat categories are simply lists of topics or subjects
They can be used for grouping, filtering (narrowing), and sorting sets of
documents in search interfaces
Most Web sites organize their information into general categories
Selecting that category narrows the set of information shown accordingly
Some experimental Web search engines automatically organize results into flat categories
Studies using this kind of design have received positive user responses (Dumais et al ,
Kules et al )
However, it can difficult to find the right subset of categories to use for the vast content of
the Web
Rather, category systems seem to work better for more focused information collections
In the early days of the Web, hierarchical directory systems such as Yahoo’s were popular
Hierarchy can also be effective in the presentation of search results over a book or
other small collection
The Superbook system was an early search interface based on this idea
In the Superbook system, the search results were shown in the context of the table-of-
contents hierarchy
An alternative representation is the faceted metadata
Unlike flat categories, faceted metadata allow the assignment of multiple categories to a single
item

Each category corresponds to a different facet (dimension or feature type) of the
collection of items
Clustering refers to the grouping of items according to some measure of similarity
It groups together documents that are similar to one another but different from the rest of the
collection
Such as all the document written in Japanese that appear in a collection of primarily
English articles
The greatest advantage of clustering is that it is fully automatable
The disadvantages of clustering include
an unpredictability in the form and quality of results
the difficulty of labeling the groups
 the counter-intuitiveness of cluster sub-hierarchies

Output produced using Findex clustering
Cluster output on the query “senate”, from Clusty.com
1.15. Visualization in Search Interfaces
Experimentation with visualization for search has been primarily applied in the following ways:
Visualizing Boolean syntax
Visualizing query terms within retrieval results
Visualizing relationships among words and documents
Visualization for text mining

1.15.1. Visualizing Boolean syntax
Boolean query syntax is difficult for most users and is rarely used in Web search
For many years, researchers have experimented with how to visualize Boolean query
specification
A common approach is to show Venn diagrams
A more flexible version of this idea was seen in the VQuery system, proposed by Steve Jones
The VQuery interface for Boolean query specification
1.15.2. Visualizing query terms within retrieval results
Understanding the role of the query terms within the retrieved docs can help relevance assessment
Experimental visualizations have been designed that make this role more explicit
In the TileBars interface, for instance, documents are shown as horizontal glyphs
The locations of the query term hits marked along the glyph
The user is encouraged to break the query into its different facets, with one concept per
line
Then, the lines show the frequency of occurrence of query terms within each topic

Ex: Tilebar interface
Other approaches include placing the query terms in bar charts, scatter plots, and tables
A usability study by Reiterer et al compared five views:
a standard Web search engine-style results listing
a list view showing titles, document metadata, and a graphic showing locations
of query terms
a color TileBars-like view
a color bar chart view like that of Veerasamy & Belkin
a scatter plot view plotting relevance scores against date of publication
Field-sortable search results view

Colored TileBars view
When asked for subjective responses, the 40 participants of the study preferred, on
average, in this order:
Field-sortable view first
TileBars
Web-style listing
The bar chart and scatter plot received negative responses

Another variation on the idea of showing query term hits within documents is to show thumbnails
Thumbnails are miniaturized rendered versions of the visual appearance of the
document
However, Czerwinski et al found that thumbnails are no better than blank squares for
improving search results
The negative study results may stem from a problem with the size of the thumbnails
Woodruff et al shows that making the query terms more visible via
highlighting within the thumbnail improves its usability
Textually enhanced thumbnails
1.15.3. Visualizing relationships among words and documents
Numerous works proposed variations on the idea of placing words and docs on a two-
dimensional canvas
In these works, proximity of glyphs represents semantic relationships among the terms or
documents
An early version of this idea is the VIBE interface
Documents containing combinations of the query terms are placed midway
between the icons representing those terms

The Aduna Autofocus and the Lyberworld projects presented a 3D version of
the ideas behind VIBE
The VIBE display
Another idea is to map docs or words from a very high- dimensional term space down into a 2D
plane
The docs or words fall within that plane, using 2D or 3D
This variation on clustering can be done to
documents retrieved as a result of a query
documents that match a query can be highlighted within a
pre-processed set of documents
InfoSky and xFIND’s VisIslands are two variations on these starfield displays
InfoSky, from Jonker et al

xFIND’s VisIslands, from Andrews et al
These views are relatively easy to compute and can be visually striking
However, evaluations that have been conducted so far provide negative evidence as to
their usefulness
The main problems are that the contents of the documents are not visible in
such views
A more promising application of this kind of idea is in the layout of thesaurus terms, in a
small network graph
Ex: Visual Wordnet
The Visual Wordnet view of the WordNet lexical thesaurus

1.15.4. Visualization for text mining
Visualization is also used for purposes of analysis and exploration of textual data
Visualizations such as the Word Tree show a piece of a text concordance
It allows the user to view which words and phrases commonly precede or
follow a given word
Another example is the NameVoyager, which shows frequencies of names for U.S. children across
time
The Word Tree visualization, on Martin Luther King’s I have a dream speech, from Wattenberg et
al
The popularity of baby names over time (names beginning with JA), from
babynamewizard.com

Visualization is also used in search interfaces intended for analysts
An example is the TRIST information triage system, from Proulx et al
In this system, search results is represented as document icons
Thousands of documents can be viewed in one display
It supports multiple linked dimensions that allow for finding characteristics and
correlations among the docs
Its designers won the IEEE Visual Analytics Science and Technology (VAST) contest for
two years running

The TRIST interface with results for queries related to Avian Flu
Tags