1
Language Model
CSC4170 Web Intelligence and Social Computing
Tutorial 8
Tutor: Tom Chao Zhou
Email: [email protected]
2
Outline
Language models
Finite automata and language models
Types of language models
Multinomial distributions over words
Query likelihood model
Application
Q&A
Reference
3
Language Models (LMs)
How can we come up with good queries?
Think of words that would likely appear in a relevant document.
Idea of LM:
A document is a good match to a query if the document model is
likely to generate the query.
4
Language Models (LMs)
Generative Model:
Recognize or generate strings.
The full set of strings that can be generated is called the language of the
automaton.
Language Model:
A function that puts a probability measure over strings drawn from some
vocabulary.
5
Language Models (LMs)
Example 1:
Calculate the probability of a word sequence.
Multiply the probabilities that the model gives to each word in the
sequence, together with the probability of continuing or stopping
after producing each word.
P(frog said that toad likes frog)=(0.01*0.03*0.04*0.01*0.02*0.01)
*(0.8*0.8*0.8*0.8*0.8*0.8*0.2)
=0.000000000001573
Most of the time, we will omit to include STOP and (1-STOP)
probabilities.
6
Language Models (LMs)
Example 2:
P(s|M
1)>P(s|M
2)
7
Language Models (LMs)
Basic LM using chain rule:
Unigram language model:
Throws away all conditioning context.
Most used in Information Retrieval.
Bigram language model:
Condition on the previous term.
8
Language Models (LMs)
Unigram LM:
Bag-of-words model.
Multinomial distributions over
words.
Mi
dtd
i
tfL
1
,
The length of document d. M is
the size of the vocabulary.
multinomial coefficient, can leave out in practical
calculations.
9
Query Likelihood Model
Query likelihood model:
Rank document by P(d|q)
Likelihood that document d is
relevant to the query.
Using Bayes rule:
P(q) is the same for all
documents.
P(d) is treated as uniform
across all d. )|()|( dqPqdP
10
Query Likelihood Model
Multinomial + Unigram:
Retrieve based on a language model:
Infer a LM for each document.
Estimate P(q|M
di).
Rank the documents according to these probabilities.
Multinomial coefficient for the query q.
Can be ignored.
11
Query Likelihood Model
Estimating the query generation probability:
Maximum Likelihood Estimation (MLE) + unigram LM
Limitations:
If we estimate P(t|Md)=0, documents will only give a query nonzero
probability if all of the query terms appear in the document.
Occurring words are poorly estimated, the probability of words
occurring once in the document is overestimated, because their
one occurrence was partly by chance.
12
Query Likelihood Model
Estimating the query generation probability:
Maximum Likelihood Estimation (MLE) + unigram LM
Smoothing:
Use the whole collection to smooth.
Linear Interpolation (Jelinek-Mercer Smoothing)
Bayesian Smoothing
13
Query Likelihood Model
Query likelihood model with linear interpolation:
Query likelihood model with Bayesian smoothing:
qt d
Cd
L
MtPMtP
dPqdP )
)|()|(
()()|(
14
Query Likelihood Model
Example using unigram + MLE + linear interpolation:
d1: Xyzzy reports a profit but revenue is down
d2: Quorus narrows quarter loss but revenue decreases further
λ=1/2
query: revenue down
ranking: d1>d2
15
Application
Community-based Question Answering (CQA) System:
Question Search.
Given a queried question, find a semantically equivalent question
for the queried question.
General Search Engine
Given a query, rank documents.