Natural Language Processing introduction-L1.pdf

ratnababum 28 views 44 slides Sep 13, 2024
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

NLP


Slide Content

BITSPilani
Pilani Campus
AIMLCZG537/DSECLZG537
Information Retrieval
Dr. Maheswari Karthikeyan
Lecture1 : 25-05-2024

BITS Pilani, Pilani Campus
•Toacquirebasicunderstandingofthecomponentsandthedifferent
IRmethods.
•Boolean
•VectorSpace
•TounderstandthevariousapplicationareasofIR:
•TextMining
•WebSearch
•CrossLingualIR
•MultimediaIR
•RecommenderSystem
•NeuralIR
Course Outline
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
1. C. D. Manning, P. Raghavan and H. Schutze.
Introduction to Information Retrieval, Cambridge
University Press, 2008. http://nlp.stanford.edu/IR-book/
2. Modern Information Retrieval, Ricardo Baeza-Yates and
Berthier Ribeiro-Neto, Addison-Wesley, 2000.
http://people.ischool.berkeley.edu/~hearst/irbook/
3. Ricci, F.; Rokach, L.; Shapira, B.; Kantor, P.B. (Eds.),
Recommender Systems Handbook. 1st Edition., 2011,
845 p. 20 illus., Hardcover, ISBN: 978-0-387-85819-7
Books to Refer
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Introduction
–Information Retrieval
–Information vs. Data Retrieval
–IR task
–Basic Concepts
–Logical view of the documents
–The retrieval process
–Classical IR models
Lecture Outline
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Information Retrieval
Document
collection
Information need
Query
Answer List
IR system
Retrieval
•To retrieve documents efficiently, relevant to an
information need from a large document set
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•IR:representation,storage,organizationof,and
accesstoinformationitems
•Focusontheuserinformationneed
•Emphasisisontheretrievalofinformation(notdata)
Motivation
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
• Search
• Filtering
• Organization
• Multiple languages
• Multiple media
Information Retrieval
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Retrospective
•“Searching the past”
•Different queries posed against a static collection
•Time invariant
•Prospective
•“Searching the future”
•Static query posed against a dynamic collection
•Time dependent
Types of Information Needs
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Input:
•A corpus of textual natural-language
documents
•A user query in the form of a textual string
•Output:
• A ranked set of documents that are
relevant to the query.
IR Task
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Document
Corpus
Query
String
Ranked
Documents
IR System
IR Task
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Relevance is a subjective judgment and may
include:
•Being on the proper subject.
•Being timely (recent information).
•Being authoritative (from a trusted source).
•Satisfying the goals of the user and intended
use of the information (information need).
Relevance
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Meaning of the words used
•Order of words in the query
•Direct or indirect feedback
•Authority of the source
Intelligent IR
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Dataretrieval
•Whichdocumentscontainasetofkeywords?
•Welldefinedstructureandsemantics
•Asingleerroneousobjectimpliesfailure
•Providesolutiontotheuserofadatabasesystem
•Informationretrieval
•Informationaboutasubjectortopic
•Semanticsisfrequentlyloose
•Smallerrorsaretolerated
•Dealswithnaturallanguagetext
IR vs. Data Retrieval
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
IR vs. Data Retrieval
Data IR
DataStructured Unstructured
Fields
Clear semantics
(SSN, age)
No fields (other than text)
Queries
Defined (relational
algebra, SQL)
Free text (“natural
language”), Boolean
Matching
Exact (results are
always “correct”)
Imprecise (need to
measure effectiveness)
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Efficient retrieval system is directly
related to
•User task
•Logical view of the documents
IR System -Basic Concepts
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•TheUserTask
•Retrieval
•Informationordata
•Purposeful
•Browsing
•Hypertextsystemsused
•Glancingaround
•Bothretrieval(adhoc)andbrowsingare“pulling”actions
•Alternativeisto“push”theinformationtowardstheuser,toexecutethe
particularretrievaltaskwhichconsistsof“filtering”relevantinformation.
Retrieval
Browsing
Database
Interaction of the user with the retrieval
system through distinct tasks
25/05/2024 INFORMATION RETRIEVAL; L1
User Task

BITS Pilani, Pilani Campus
•Documents in a collection are frequently represented through a set
of index terms or keywords
•Keywords are extracted from document
•Keywords are derived automatically or generated by a specialist,
they provide a logical view of the document
•Stop-words
•To reduce the set of representative keywords from large collection
•Function words do not bear useful information for IR,
•i.e. of, in, about, with, I, although, …
•Stop-list: contain stop-words, not to be used as index
•Prepositions, Articles, Pronouns
•Some adverbs and adjectives, Some frequent words (e.g. document)
•The removal of stop-words usually improves IR effectiveness
•A few “standard” stop-lists are commonly used.
Logical view of the documents
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Basic Concepts
Logical view of the document: from full text to a set of index terms
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Noun groups
•To identify the noun groups
•Which eliminates the adjectives, adverbs and verbs
•Reason for stemming
•Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
•Stemming
•Which reduces distinct words to their common grammatical root
•Removing some endings of word
computer
compute
computes
computing
computed
computation
Logical view of the documents
comput
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
The Retrieval Process
User
Interface
Text Operations
Query
Operations Indexing
Searching
Ranking
Index
Text
query
user need
user feedback
ranked docs
retrieved docs
logical viewlogical view
inverted file
DB Manager
Module
Text
Database
Text
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Text Database
•Done by the DB manager :
•The documents to be used
•The operations to be performed on the text
•The text model, i.e. the text structure and what
elements can be used for retrieval
•Text operations transform the original documents and
generate a logical view of them
•The database manager builds an index of the text i.e.
“inverted file”
•Query operations - generate actual “query” based on the
used needs, to retrieve the relevant document
•The retrieved documents are ranked, and listed.
The Retrieval Process
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Text Operations forms index words (tokens)
•Stop-word removal
•Stemming
•Indexing constructs an inverted index of word to document pointers
•Searching retrieves documents that contain a given query token from
the inverted index
•Ranking scores all retrieved documents according to a relevance metric
•User Interface manages interaction with the user:
•Query input and document output.
•Relevance feedback.
•Visualization of results.
•Query Operations transform the query to improve retrieval:
•Query expansion
•Query transformation using relevance feedback
IR System Components
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
25/05/2024 INFORMATION RETRIEVAL; L1
Information Retrieval Models

BITS Pilani, Pilani Campus
•TraditionalIRusesIndexTermstoretrievedocuments
•Arankingisanorderingofthedocumentsretrievedtotheuserquery
•Arankingisbasedonfundamentalpremisesregardingthenotionof
relevance,suchas:
•commonsetsofindexterms
•sharingofweightedterms
•likelihoodofrelevance
•EachsetofpremisesleadstoadistinctIRmodel
Information Retrieval Models
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Modeling
Docs
Information Need
Index Terms
doc
query
Ranking
match
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Non-Overlapping Lists
Proximal Nodes
Structured Models
Retrieval:
Adhoc
Filtering
Browsing
U
s
e
r
T
a
s
k
Classic Models
boolean
vector
probabilistic
Set Theoretic
Fuzzy
Extended Boolean
Probabilistic
Inference Network
Belief Network
Algebraic
Generalized Vector
Lat. Semantic
Index
Neural Networks
Browsing
Flat
Structure Guided
Hypertext
Taxonomy of IR Models
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•TheIRmodel,thelogicalviewofthedocs,andtheretrieval
taskaredistinctaspectsofthesystem
Index Terms

Full Text

Full Text +
Structure



Retrieval

Classic
Set Theoretic
Algebraic
Probabilistic

Classic
Set Theoretic
Algebraic
Probabilistic



Structured


Browsing



Flat

Flat
Hypertext


Structure Guided
Hypertext

LOGICAL VIEW OF DOCUMENTS
U
S
E
R

T
A
S
K
Taxonomy of IR Models
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Retrieval -Ad hoc
Collection
“Fixed Size”
Q2
Q3
Q1
Q4
Q5
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Documents Stream
User 1
Profile
User 2
Profile
Docs Filtered
for User 2
Docs for
User 1
Retrieval -Filtering
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Basic concepts
•A set of representative keywords called index terms
•Consider all distinct words as index terms
•Index terms are mainly nouns
•Searchenginesassumethatallwordsareindexterms
•Properties of index terms – useful and less useful
•Notalltermsareequallyusefulforrepresentingthe
documentcontents
•Theimportanceoftheindextermsisrepresentedby
weightsassociatedtothem
Classic Information Retrieval
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Definition
•k
iisanindexterm
•d
jisadocument
•t-isthetotalnumberofindexes
•K=(k
1,k
2,…,k
t)isthesetofallindexterms
•w
ij>=0isaweightassociatedwith(k
i,d
j)
•w
ij=0indicatesthattermdoesnotbelongtodoc
•vec(d
j)=(w
1j,w
2j,…,w
tj)isaweightedvector
associatedwiththedocumentd
j
•g
i(vec(d
j))=w
ijisafunctionwhichreturnstheweight
associatedwithpair(k
i,d
j)
Classic Information Retrieval
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Boolean model
•Vector Space model
•Probabilistic model
Classical IR Models
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Simplemodelbased on set theory and Boolean algebra
•Documents are sets of terms
•Queries are Boolean expressions on terms
•Historically the most common model
•Library OPACs
•Dialog system
•Many web search engines
•Queriesspecifiedasbooleanexpressions
•Precisesemantics
•Neatformalism
•Termsareeitherpresentorabsent.Thus,w
ij{1,0}
•Therearethreeconnectivesused:and,or,not
Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•D: set of words (indexing terms) present in a document
•each term is either present (1) or absent (0)
•Q: A Boolean expression
•terms are index terms
•operators are AND, OR, and NOT
•F: Boolean algebra over sets of terms and sets of documents
•R: a document is predicted as relevant to a query expression if
it satisfies the query expression
• ((text information)  retrieval theory)
•Each query term specifies a set of documents containing the term
•AND (): the intersection of two sets
•OR (): the union of two sets
•NOT (): set inverse, or really set difference
Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Definition
–Index term weight variables all are binary
–w
ij{1,0}
–Query q = k
a(k
bk
c)
–sim(q
i,d
j) = 1 , i.e. doc’s are relevant
0, otherwise i.e. doc’s are
not relevant
(1,1,1)
(1,0,0)
(1,1,0)
K
a K
b
K
c
Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Advantages
•Clean Formalism
•Easy to implement
•Intuitive concept
•Still, it is a dominant model for document database
systems.
Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Retrieval based on binary decision criteria with no notion of
partial matching
•No ranking of the documents is provided (absence of a
grading scale)
•Information need has to be translated into a Boolean
expression which most users find difficult
•The Boolean queries formulated by the users are most often
too simplistic
•Frequently returns either too few or too many documents in
response to a user query.
Limitations of Boolean Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Useofbinaryweightsistoolimiting
•Non-binaryweightsprovideconsiderationforpartialmatches
•Thesetermweightsareusedtocomputeadegreeofsimilarity
betweenaqueryandeachdocument
•Rankedsetofdocumentsprovidesforbettermatching
Define:
–w
i,j>= 0 associated with the pair (ki,dj)
–vec(d
j) = (w
1,j, w
2,j, ..., w
t,j)
–w
i,q>= 0 associated with the pair (k
i,q)
–vec(q) = (w
1,q, w
2,q, ..., w
t,q)
–t-total no. of index terms in the collection
Vector Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
Sim(d
j, q) =
•A document is retrieved even if it matches the query terms only
partially
A good weight must take into account of two effects:
•quantification of intra-document contents (similarity)
–tf factor, the term frequencywithin a document
•quantification of inter-documents separation (dis-similarity)
–idf factor, the inverse document frequency
•w
ij= tf * idf
i
j
dj
Q
 

= =
=

=
t
i
t
j
t
i
ww
ww
qd
qd
qi
x
ji
qi
x
ji
xj
j
1 1
22
1
,,
)
,,
(


Vector Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Advantages
•Simple model based on linear algebra
•Term weights not binary
•Allows computing a continuous degree of similarity between
queries and documents
•Allows ranking documents according to their possible relevance
•Allows partial matching
•Allows efficient implementation for large document collections
•Disadvantages
•Index terms are assumed to be mutually independent
•Search keywords must precisely match document terms
•Long documents are poorly represented
•The order in which the terms appear in the document is lost in the
vector space representation
•Weighting is intuitive, but not very formal.
Vector Model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•The model is called as BIR (Binary Independence
Retrieval)
•It uses a probabilisticframework
•Givenauserquery,thereisanidealanswerset
•Guessatthebeginningwhattheycouldbe(i.e.,guess
initialdescriptionofidealanswerset)
•User look retrieved doc’s are either relevant or non-
relevant
•Improvebyiteration.
Probabilistic model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•An initial set of documents is retrieved, can be done using
vector model, Boolean model
•User inspects these docs looking for the relevant ones
•IR system uses this information to refine description of
ideal answer set
•By repeating this process, it is expected that the
description of the ideal answer set will improve
•Description of ideal answer set is modelled in probabilistic
terms.
Probabilistic model
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Given a user query q and a document dj, the probabilistic
model tries to estimate the probability that the user will find
the document d
j interesting (i.e., relevant)
• The model assumes that this probability of relevance
depends on the query and the document representations
only
• Ideal answer set is referred to as R and should maximize
the probability of relevance. Documents in the set R are
predicted to be relevant.

Probabilistic model- Ranking
25/05/2024 INFORMATION RETRIEVAL; L1

BITS Pilani, Pilani Campus
•Advantages
•Documents are ranked in decreasing order of their
probability of relevant
•Disadvantages
•Need to guess the initial separation of documents into
relevant and non-relevant sets
•All weights are binary
•The adoption of the independence assumption for
index terms.
Probabilistic model
25/05/2024 INFORMATION RETRIEVAL; L1
Tags