UNIT 1- C-1.pptx natural language processing

saurabhtiwarig21 1 views 48 slides Oct 09, 2025
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

Notes - natural language processing


Slide Content

NLP Basics of word level analysis

WORD LEVEL ANALYSIS CORUPS UNSMOOTHED N-GRAMS EVALUATING N-GRAMS SMOOTHING INTERPOLATION BACKOFF WORD CLASSES

CORPUS A corpus  in NLP is a large, organized collection of text or speech data that computers can read and analyze to help with language tasks. A corpus is like a big folder filled with real sentences , documents, or even audio that people have written or spoken. Example: Imagine a corpus made from three short sentences: “I am happy.” “NLP is fun.” “How are you?” Computers use these sentences in NLP tasks, such as translation or chatbots, to learn how words are used together.

What are N-Grams? N-Grams  are contiguous sequences of n items (words/characters) from a text or speech corpus. Types: Unigram:  Single word (e.g., "I") Bigram:  Pair of consecutive words (e.g., "I am") Trigram:  Three consecutive words (e.g., "I am Sam")

WHAT IS Unsmoothed N-Gram? Unsmoothed n-gram counts only the word groups that actually appear in the data (“corpus”). If a word group never appears in the corpus , its probability becomes zero.

What is Unsmoothed N-Gram?.. Unsmoothed N-Gram in Natural Language Processing (NLP) means calculating the probability of sequences of words (called N-grams) directly from their observed frequencies in a text corpus without applying any smoothing techniques. In this method, if a particular word sequence never appeared in the training data , its probability is zero , which is a key limitation.

What is Unsmoothed N-Gram?.. In unsmoothed N-grams, the probability of a sequence is calculated as the ratio of how many times that sequence appears to the total number of such sequences in the corpus. If a sequence is not seen in the training data, its probability is zero ( which can cause problems when predicting new sequences ).

What is Unsmoothed N-Gram?.. Example 1: Suppose the corpus is: "I love natural language processing" "Natural language processing is fun“

Some common tasks Tokenization – splitting text into words or phrases. Lowercasing – converting all words to lowercase. Stemming or lemmatization – reducing words to their base form. Bag of Words ( BoW ) – counting occurrences of each word. N-grams – grouping sequences of words.

Tokenize and lowercase "I love natural language processing" → ["i", "love", "natural", "language", "processing"] "Natural language processing is fun" → ["natural", "language", "processing", "is", "fun"]

Build vocabulary(V) Unique words: ["i", "love", "natural", "language", "processing", "is", "fun"]

Count occurrences

Example 1: Calculate Bigram probabilities (UNSMOOTHED) For same above example Sentence 1: "I love natural language processing“ Sentence 2: "Natural language processing is fun"

Solution Sentence 1 → ['i', 'love', 'natural', 'language', 'processing'] Sentence 2 → ['natural', 'language', 'processing', 'is', 'fun']

Solution…. Now list all bigrams:

Solution… Count occurrences:

Solution… Unigram counts (for denominator in probabilities):

Solution… Calculate bigram probabilities:

Final Solution Finally, Now compute them one by one:

Note Unsmoothed means we are directly using the observed counts without making any adjustments. If a word pair never appeared in the data, its probability would be . 2) Smoothing techniques , like Laplace smoothing , are used to avoid zero probabilities by adding small counts artificially.

Why do we need smoothing? Data sparsity – In real-world text, many valid word sequences don’t appear in the training set. Avoiding zero probabilities – Without smoothing, unseen sequences would be impossible, even if they are likely in natural language. Generalization – It helps the model handle new text better by assuming some uncertainty.

1. Add-One (Laplace) Smoothing Where: V Denotes total number of unique words

Example:1 Based on Laplace Smoothing Suppose we have a biagram “Processing fun” Biagram : ('processing', 'fun’) Clearly It never appeared, without smoothing it would be 0.

Solution ( Based on Laplace Smoothing add 1) With Laplace smoothing:

Add-k Smoothing (Generalization of Laplace) Instead of adding 1, we add a small constant k (say), Generally k= 0.5 or 0.01.

Example 2: Based on Add-k Smoothing (Generalization of Laplace) This is useful when adding too much weight (like 1) to unseen events distorts the model.

Backoff Smoothing If a Ngram’s count is zero, it backs off to the N-1 gram, and if that’s zero, it backs off to the N-2 gram and so on. Example: If a trigram’s count is zero, it backs off to the bigram, and if that’s zero, it backs off to the unigram.

Example 1:…. Now, If we check "processing love," which never appears, its probability is 0 .

Important Key Points Unsmoothed N-grams are simple to calculate and understand but face zero probability for unseen sequences. This affects the model's ability to generalize to new text. Smoothing techniques are used to fix this zero-probability problem .

Smoothing in NLP A technique to fix the zero-probability problem caused by unsmoothed N-gram models. When an N-gram sequence does not appear in the training data, its unsmoothed probability is zero, which is problematic for predicting or generating new text because it suggests that the sequence is impossible. Smoothing assigns a small, non-zero probability to these unseen sequences so the model can better generalize.

Explanation Using the Previous Example: Example 1: The bigram example with corpus sentences: "I love natural language processing" "Natural language processing is fun" Here, Unsmoothed bigram probabilities gave zero probability to an unseen bigram like " processing love ."

Note: Vocabulary Size V Suppose the corpus consists of two sentences: "I love natural language processing“. "Natural language processing is fun“ First, list all the distinct words (unique tokens) appearing in these sentences: I love natural language processing is fun Count the unique words: There are 7 unique words. If we consider case differences, the vocabulary size might increase. For example, if "Natural" and "natural" are treated distinctly (case-sensitive), that makes 8 unique tokens.

Backoff smoothing Backoff smoothing in NLP is a method used to handle zero probabilities for unseen n-grams by "backing off" to lower-order n-grams when the higher-order n-gram count is zero. It means if you don't have data for a specific 3-word sequence (trigram), use the 2-word sequence (bigram) instead; if that's missing, use the 1-word probability (unigram).

Explanation of Backoff with an Example Example: 1 Suppose the corpus is: "I love natural language processing" "Natural language processing is fun" Now, We want to predict the probability of a trigram (three-word sequence) like i) ("language", "processing“, “is”) and ii) ("love", “natural“, “fun”)

Solution Step 1 – Tokenize and list bigrams & trigrams in sentence 1 and sentence 2

Solution.. Bigram and trigram counts:

Solution… Trigrams:

Solution (i) Now,for i) ("language", "processing“, “is”) Calculate P( is ∣ language,processing )

Solution (ii) Backoff required Backoff required here, Since, trigram count: Now, Since this trigram never occurred, we backoff to the bigram P( fun∣natural )

Solution (ii) Check unigram count:

Solution (ii) Final result using backoff smoothing

Example: 1… Count of trigram ("natural language processing") = 2 Count of bigram ("language processing") = 3 Count of unigram ("processing") = 4 Now consider trigram (" processing is fun "): Count of trigram ("processing is fun") = 0 (unseen)

Example: 1… Since the trigram count is zero, backoff means: Instead of using trigram, check the bigram (" is fun ") count. If bigram is available, use its probability. If bigram is also zero, then use unigram (" fun ") probability. This way, backoff avoids zero probability issues by falling back to shorter histories.

Note This is not the standard way to compute conditional probabilities like bigrams or trigrams. It’s sometimes used in very basic frequency models, but it doesn’t correctly represent the true probability distribution that model's language. T his formula is used when We want the joint probability of a bigram, rather than the conditional probability.

Note… Example: Suppose If the total number of bigrams is 7, and the bigram ('language', 'processing') appears 2 times, then: P(′ language′,′processing ′)=2/10 =0.2 This tells us that this pair represents 20% of all bigrams in the dataset.

Note

Conclusion Use the joint probability formula when you're analyzing how frequent a bigram is relative to all bigrams. Use the conditional probability formula when you're modeling language and want to predict the next word based on the previous one. Hence, in NLP language modeling we use this approach.
Tags