Natural Language Processing introduction-L1.pdf

BITSPilani
Pilani Campus
AIMLCZG537/DSECLZG537
Information Retrieval
Dr. Maheswari Karthikeyan
Lecture1 : 25-05-2024

BITS Pilani, Pilani Campus
•Toacquirebasicunderstandingofthecomponentsandthedifferent
IRmethods.
•Boolean
•VectorSpace
•TounderstandthevariousapplicationareasofIR:
•TextMining
•WebSearch
•CrossLingualIR
•MultimediaIR
•RecommenderSystem
•NeuralIR
Course Outline
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
1. C. D. Manning, P. Raghavan and H. Schutze.
Introduction to Information Retrieval, Cambridge
University Press, 2008. http://nlp.stanford.edu/IR-book/
2. Modern Information Retrieval, Ricardo Baeza-Yates and
Berthier Ribeiro-Neto, Addison-Wesley, 2000.
http://people.ischool.berkeley.edu/~hearst/irbook/
3. Ricci, F.; Rokach, L.; Shapira, B.; Kantor, P.B. (Eds.),
Recommender Systems Handbook. 1st Edition., 2011,
845 p. 20 illus., Hardcover, ISBN: 978-0-387-85819-7
Books to Refer
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Introduction
–Information Retrieval
–Information vs. Data Retrieval
–IR task
–Basic Concepts
–Logical view of the documents
–The retrieval process
–Classical IR models
Lecture Outline
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Information Retrieval
Document
collection
Information need
Query
Answer List
IR system
Retrieval
•To retrieve documents efficiently, relevant to an
information need from a large document set
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•IR:representation,storage,organizationof,and
accesstoinformationitems
•Focusontheuserinformationneed
•Emphasisisontheretrievalofinformation(notdata)
Motivation
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
• Search
• Filtering
• Organization
• Multiple languages
• Multiple media
Information Retrieval
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Retrospective
•“Searching the past”
•Different queries posed against a static collection
•Time invariant
•Prospective
•“Searching the future”
•Static query posed against a dynamic collection
•Time dependent
Types of Information Needs
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Input:
•A corpus of textual natural-language
documents
•A user query in the form of a textual string
•Output:
• A ranked set of documents that are
relevant to the query.
IR Task
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Document
Corpus
Query
String
Ranked
Documents
IR System
IR Task
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Relevance is a subjective judgment and may
include:
•Being on the proper subject.
•Being timely (recent information).
•Being authoritative (from a trusted source).
•Satisfying the goals of the user and intended
use of the information (information need).
Relevance
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Meaning of the words used
•Order of words in the query
•Direct or indirect feedback
•Authority of the source
Intelligent IR
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Dataretrieval
•Whichdocumentscontainasetofkeywords?
•Welldefinedstructureandsemantics
•Asingleerroneousobjectimpliesfailure
•Providesolutiontotheuserofadatabasesystem
•Informationretrieval
•Informationaboutasubjectortopic
•Semanticsisfrequentlyloose
•Smallerrorsaretolerated
•Dealswithnaturallanguagetext
IR vs. Data Retrieval
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
IR vs. Data Retrieval
Data IR
DataStructured Unstructured
Fields
Clear semantics
(SSN, age)
No fields (other than text)
Queries
Defined (relational
algebra, SQL)
Free text (“natural
language”), Boolean
Matching
Exact (results are
always “correct”)
Imprecise (need to
measure effectiveness)
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Efficient retrieval system is directly
related to
•User task
•Logical view of the documents
IR System -Basic Concepts
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•TheUserTask
•Retrieval
•Informationordata
•Purposeful
•Browsing
•Hypertextsystemsused
•Glancingaround
•Bothretrieval(adhoc)andbrowsingare“pulling”actions
•Alternativeisto“push”theinformationtowardstheuser,toexecutethe
particularretrievaltaskwhichconsistsof“filtering”relevantinformation.
Retrieval
Browsing
Database
Interaction of the user with the retrieval
system through distinct tasks
25/05/2024 INFORMATION RETRIEVAL; L1
User Task

BITS Pilani, Pilani Campus
•Documents in a collection are frequently represented through a set
of index terms or keywords
•Keywords are extracted from document
•Keywords are derived automatically or generated by a specialist,
they provide a logical view of the document
•Stop-words
•To reduce the set of representative keywords from large collection
•Function words do not bear useful information for IR,
•i.e. of, in, about, with, I, although, …
•Stop-list: contain stop-words, not to be used as index
•Prepositions, Articles, Pronouns
•Some adverbs and adjectives, Some frequent words (e.g. document)
•The removal of stop-words usually improves IR effectiveness
•A few “standard” stop-lists are commonly used.
Logical view of the documents
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Basic Concepts
Logical view of the document: from full text to a set of index terms
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Noun groups
•To identify the noun groups
•Which eliminates the adjectives, adverbs and verbs
•Reason for stemming
•Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
•Stemming
•Which reduces distinct words to their common grammatical root
•Removing some endings of word
computer
compute
computes
computing
computed
computation
Logical view of the documents
comput
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
The Retrieval Process
User
Interface
Text Operations
Query
Operations Indexing
Searching
Ranking
Index
Text
query
user need
user feedback
ranked docs
retrieved docs
logical viewlogical view
inverted file
DB Manager
Module
Text
Database
Text
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Text Database
•Done by the DB manager :
•The documents to be used
•The operations to be performed on the text
•The text model, i.e. the text structure and what
elements can be used for retrieval
•Text operations transform the original documents and
generate a logical view of them
•The database manager builds an index of the text i.e.
“inverted file”
•Query operations - generate actual “query” based on the
used needs, to retrieve the relevant document
•The retrieved documents are ranked, and listed.
The Retrieval Process
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Text Operations forms index words (tokens)
•Stop-word removal
•Stemming
•Indexing constructs an inverted index of word to document pointers
•Searching retrieves documents that contain a given query token from
the inverted index
•Ranking scores all retrieved documents according to a relevance metric
•User Interface manages interaction with the user:
•Query input and document output.
•Relevance feedback.
•Visualization of results.
•Query Operations transform the query to improve retrieval:
•Query expansion
•Query transformation using relevance feedback
IR System Components
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
25/05/2024 INFORMATION RETRIEVAL; L1
Information Retrieval Models

BITS Pilani, Pilani Campus
•TraditionalIRusesIndexTermstoretrievedocuments
•Arankingisanorderingofthedocumentsretrievedtotheuserquery
•Arankingisbasedonfundamentalpremisesregardingthenotionof
relevance,suchas:
•commonsetsofindexterms
•sharingofweightedterms
•likelihoodofrelevance
•EachsetofpremisesleadstoadistinctIRmodel
Information Retrieval Models
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Modeling
Docs
Information Need
Index Terms
doc
query
Ranking
match
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Non-Overlapping Lists
Proximal Nodes
Structured Models
Retrieval:
Adhoc
Filtering
Browsing
U
s
e
r
T
a
s
k
Classic Models
boolean
vector
probabilistic
Set Theoretic
Fuzzy
Extended Boolean
Probabilistic
Inference Network
Belief Network
Algebraic
Generalized Vector
Lat. Semantic
Index
Neural Networks
Browsing
Flat
Structure Guided
Hypertext
Taxonomy of IR Models
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•TheIRmodel,thelogicalviewofthedocs,andtheretrieval
taskaredistinctaspectsofthesystem
Index Terms

Full Text

Full Text +
Structure

Retrieval

Classic
Set Theoretic
Algebraic
Probabilistic

Classic
Set Theoretic
Algebraic
Probabilistic

Structured

Browsing

Flat

Flat
Hypertext

Structure Guided
Hypertext

LOGICAL VIEW OF DOCUMENTS
U
S
E
R

T
A
S
K
Taxonomy of IR Models
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Retrieval -Ad hoc
Collection
“Fixed Size”
Q2
Q3
Q1
Q4
Q5
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Documents Stream
User 1
Profile
User 2
Profile
Docs Filtered
for User 2
Docs for
User 1
Retrieval -Filtering
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Basic concepts
•A set of representative keywords called index terms
•Consider all distinct words as index terms
•Index terms are mainly nouns
•Searchenginesassumethatallwordsareindexterms
•Properties of index terms – useful and less useful
•Notalltermsareequallyusefulforrepresentingthe
documentcontents
•Theimportanceoftheindextermsisrepresentedby
weightsassociatedtothem
Classic Information Retrieval
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Definition
•k
iisanindexterm
•d
jisadocument
•t-isthetotalnumberofindexes
•K=(k
1,k
2,…,k
t)isthesetofallindexterms
•w
ij>=0isaweightassociatedwith(k
i,d
j)
•w
ij=0indicatesthattermdoesnotbelongtodoc
•vec(d
j)=(w
1j,w
2j,…,w
tj)isaweightedvector
associatedwiththedocumentd
j
•g
i(vec(d
j))=w
ijisafunctionwhichreturnstheweight
associatedwithpair(k
i,d
j)
Classic Information Retrieval
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Boolean model
•Vector Space model
•Probabilistic model
Classical IR Models
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Simplemodelbased on set theory and Boolean algebra
•Documents are sets of terms
•Queries are Boolean expressions on terms
•Historically the most common model
•Library OPACs
•Dialog system
•Many web search engines
•Queriesspecifiedasbooleanexpressions
•Precisesemantics
•Neatformalism
•Termsareeitherpresentorabsent.Thus,w
ij{1,0}
•Therearethreeconnectivesused:and,or,not
Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•D: set of words (indexing terms) present in a document
•each term is either present (1) or absent (0)
•Q: A Boolean expression
•terms are index terms
•operators are AND, OR, and NOT
•F: Boolean algebra over sets of terms and sets of documents
•R: a document is predicted as relevant to a query expression if
it satisfies the query expression
• ((text information)  retrieval theory)
•Each query term specifies a set of documents containing the term
•AND (): the intersection of two sets
•OR (): the union of two sets
•NOT (): set inverse, or really set difference
Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Definition
–Index term weight variables all are binary
–w
ij{1,0}
–Query q = k
a(k
bk
c)
–sim(q
i,d
j) = 1 , i.e. doc’s are relevant
0, otherwise i.e. doc’s are
not relevant
(1,1,1)
(1,0,0)
(1,1,0)
K
a K
b
K
c
Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Advantages
•Clean Formalism
•Easy to implement
•Intuitive concept
•Still, it is a dominant model for document database
systems.
Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Retrieval based on binary decision criteria with no notion of
partial matching
•No ranking of the documents is provided (absence of a
grading scale)
•Information need has to be translated into a Boolean
expression which most users find difficult
•The Boolean queries formulated by the users are most often
too simplistic
•Frequently returns either too few or too many documents in
response to a user query.
Limitations of Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Useofbinaryweightsistoolimiting
•Non-binaryweightsprovideconsiderationforpartialmatches
•Thesetermweightsareusedtocomputeadegreeofsimilarity
betweenaqueryandeachdocument
•Rankedsetofdocumentsprovidesforbettermatching
Define:
–w
i,j>= 0 associated with the pair (ki,dj)
–vec(d
j) = (w
1,j, w
2,j, ..., w
t,j)
–w
i,q>= 0 associated with the pair (k
i,q)
–vec(q) = (w
1,q, w
2,q, ..., w
t,q)
–t-total no. of index terms in the collection
Vector Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Sim(d
j, q) =
•A document is retrieved even if it matches the query terms only
partially
A good weight must take into account of two effects:
•quantification of intra-document contents (similarity)
–tf factor, the term frequencywithin a document
•quantification of inter-documents separation (dis-similarity)
–idf factor, the inverse document frequency
•w
ij= tf * idf
i
j
dj
Q
 

= =
=
•
=
t
i
t
j
t
i
ww
ww
qd
qd
qi
x
ji
qi
x
ji
xj
j
1 1
22
1
,,
)
,,
(


Vector Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Advantages
•Simple model based on linear algebra
•Term weights not binary
•Allows computing a continuous degree of similarity between
queries and documents
•Allows ranking documents according to their possible relevance
•Allows partial matching
•Allows efficient implementation for large document collections
•Disadvantages
•Index terms are assumed to be mutually independent
•Search keywords must precisely match document terms
•Long documents are poorly represented
•The order in which the terms appear in the document is lost in the
vector space representation
•Weighting is intuitive, but not very formal.
Vector Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•The model is called as BIR (Binary Independence
Retrieval)
•It uses a probabilisticframework
•Givenauserquery,thereisanidealanswerset
•Guessatthebeginningwhattheycouldbe(i.e.,guess
initialdescriptionofidealanswerset)
•User look retrieved doc’s are either relevant or non-
relevant
•Improvebyiteration.
Probabilistic model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•An initial set of documents is retrieved, can be done using
vector model, Boolean model
•User inspects these docs looking for the relevant ones
•IR system uses this information to refine description of
ideal answer set
•By repeating this process, it is expected that the
description of the ideal answer set will improve
•Description of ideal answer set is modelled in probabilistic
terms.
Probabilistic model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Given a user query q and a document dj, the probabilistic
model tries to estimate the probability that the user will find
the document d
j interesting (i.e., relevant)
• The model assumes that this probability of relevance
depends on the query and the document representations
only
• Ideal answer set is referred to as R and should maximize
the probability of relevance. Documents in the set R are
predicted to be relevant.

Probabilistic model- Ranking
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Advantages
•Documents are ranked in decreasing order of their
probability of relevant
•Disadvantages
•Need to guess the initial separation of documents into
relevant and non-relevant sets
•All weights are binary
•The adoption of the independence assumption for
index terms.
Probabilistic model
25/05/2024 INFORMATION RETRIEVAL; L1

Natural Language Processing introduction-L1.pdf

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Natural Language Processing introduction-L1.pdf

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......