How to estimate these probabilities
Could we just count and divide?
No! Too many possible sentences!
We’ll never see enough data for estimating these
2CHAPTER3•N-GRAMLANGUAGEMODELS
of Walden. But we also (in a bit of terminological ambiguity) use the word ‘n-
gram’ to mean a probabilistic model that can estimate the probability of a word given
the n-1 previous words, and thereby also to assign probabilities to entire sequences.
In later chapters we will introduce the much more powerful neurallarge lan-
guage models, based on thetransformerarchitecture of Chapter 10. But because
n-grams have a remarkably simple and clear formalization, we use them to introduce
some major concepts of large language modeling, includingtraining and test sets,
perplexity,sampling, andinterpolation.
3.1 N-Grams
Let’s begin with the task of computingP(w|h), the probability of a wordwgiven
some historyh. Suppose the historyhis “The water of Walden Pond is so
beautifully” and we want to know the probability that the next word isblue:
P(blue|The water of Walden Pond is so beautifully) (3.1)
One way to estimate this probability is directly from relative frequency counts: take a
very large corpus, count the number of times we seeThe water of Walden Pond
is so beautifully, and count the number of times this is followed byblue. This
would be answering the question “Out of the times we saw the historyh, how many
times was it followed by the wordw”, as follows:
P(blue|The water of Walden Pond is so beautifully)=
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully)
(3.2)
If we had a large enough corpus, we could compute these two counts and estimate
the probability from Eq.3.2. But even the entire web isn’t big enough to give us
good estimates for counts of entire sentences. This is because language iscreative;
new sentences are invented all the time, and we can’t expect to get accurate counts
for such large objects as entire sentences. For this reason, we’ll need more clever
ways to estimate the probability of a wordwgiven a historyh, or the probability of
an entire word sequenceW.
Let’s start with some notation. First, throughout this chapter we’ll continue to
refer towords, although in practice we usually compute language models overto-
kenslike the BPE tokens of page??. To represent the probability of a particular
random variableXitaking on the value “the”, orP(Xi=“the”), we will use the
simplificationP(the). We’ll represent a sequence ofnwords either asw1...wnor
w1:n. Thus the expressionw1:nH1means the stringw1,w2,...,wnH1, but we’ll also
be using the equivalent notationw<n, which can be read as “all the elements ofw
fromw1up to and includingwnH1”. For the joint probability of each word in a se-
quence having a particular valueP(X1=w1,X2=w2,X3=w3,...,Xn=wn)we’ll
useP(w1,w2,...,wn).
Now, how can we compute probabilities of entire sequences likeP(w1,w2,...,wn)?
One thing we can do is decompose this probability using thechain rule of proba-
bility:
P(X1...Xn)=P(X1)P(X2|X1)P(X3|X1:2)...P(Xn|X1:nH1)
=
n
Y
k=1
P(Xk|X1:kH1) (3.3)
2CHAPTER3•N-GRAMLANGUAGEMODELS
of Walden. But we also (in a bit of terminological ambiguity) use the word ‘n-
gram’ to mean a probabilistic model that can estimate the probability of a word given
the n-1 previous words, and thereby also to assign probabilities to entire sequences.
In later chapters we will introduce the much more powerful neurallarge lan-
guage models, based on thetransformerarchitecture of Chapter 10. But because
n-grams have a remarkably simple and clear formalization, we use them to introduce
some major concepts of large language modeling, includingtraining and test sets,
perplexity,sampling, andinterpolation.
3.1 N-Grams
Let’s begin with the task of computingP(w|h), the probability of a wordwgiven
some historyh. Suppose the historyhis “The water of Walden Pond is so
beautifully” and we want to know the probability that the next word isblue:
P(blue|The water of Walden Pond is so beautifully) (3.1)
One way to estimate this probability is directly from relative frequency counts: take a
very large corpus, count the number of times we seeThe water of Walden Pond
is so beautifully, and count the number of times this is followed byblue. This
would be answering the question “Out of the times we saw the historyh, how many
times was it followed by the wordw”, as follows:
P(blue|The water of Walden Pond is so beautifully)=
C(The water of Walden Pond is so beautifully blue)
C(The water of Walden Pond is so beautifully)
(3.2)
If we had a large enough corpus, we could compute these two counts and estimate
the probability from Eq.3.2. But even the entire web isn’t big enough to give us
good estimates for counts of entire sentences. This is because language iscreative;
new sentences are invented all the time, and we can’t expect to get accurate counts
for such large objects as entire sentences. For this reason, we’ll need more clever
ways to estimate the probability of a wordwgiven a historyh, or the probability of
an entire word sequenceW.
Let’s start with some notation. First, throughout this chapter we’ll continue to
refer towords, although in practice we usually compute language models overto-
kenslike the BPE tokens of page??. To represent the probability of a particular
random variableXitaking on the value “the”, orP(Xi=“the”), we will use the
simplificationP(the). We’ll represent a sequence ofnwords either asw1...wnor
w1:n. Thus the expressionw1:nH1means the stringw1,w2,...,wnH1, but we’ll also
be using the equivalent notationw<n, which can be read as “all the elements ofw
fromw1up to and includingwnH1”. For the joint probability of each word in a se-
quence having a particular valueP(X1=w1,X2=w2,X3=w3,...,Xn=wn)we’ll
useP(w1,w2,...,wn).
Now, how can we compute probabilities of entire sequences likeP(w1,w2,...,wn)?
One thing we can do is decompose this probability using thechain rule of proba-
bility:
P(X1...Xn)=P(X1)P(X2|X1)P(X3|X1:2)...P(Xn|X1:nH1)
=
n
Y
k=1
P(Xk|X1:kH1) (3.3)
=