UNIT 1- C-1.pptx natural language processing

NLP Basics of word level analysis

WORD LEVEL ANALYSIS CORUPS UNSMOOTHED N-GRAMS EVALUATING N-GRAMS SMOOTHING INTERPOLATION BACKOFF WORD CLASSES

CORPUS A corpus in NLP is a large, organized collection of text or speech data that computers can read and analyze to help with language tasks. A corpus is like a big folder filled with real sentences , documents, or even audio that people have written or spoken. Example: Imagine a corpus made from three short sentences: “I am happy.” “NLP is fun.” “How are you?” Computers use these sentences in NLP tasks, such as translation or chatbots, to learn how words are used together.

What are N-Grams? N-Grams are contiguous sequences of n items (words/characters) from a text or speech corpus. Types: Unigram: Single word (e.g., "I") Bigram: Pair of consecutive words (e.g., "I am") Trigram: Three consecutive words (e.g., "I am Sam")

WHAT IS Unsmoothed N-Gram? Unsmoothed n-gram counts only the word groups that actually appear in the data (“corpus”). If a word group never appears in the corpus , its probability becomes zero.

What is Unsmoothed N-Gram?.. Unsmoothed N-Gram in Natural Language Processing (NLP) means calculating the probability of sequences of words (called N-grams) directly from their observed frequencies in a text corpus without applying any smoothing techniques. In this method, if a particular word sequence never appeared in the training data , its probability is zero , which is a key limitation.

What is Unsmoothed N-Gram?.. In unsmoothed N-grams, the probability of a sequence is calculated as the ratio of how many times that sequence appears to the total number of such sequences in the corpus. If a sequence is not seen in the training data, its probability is zero ( which can cause problems when predicting new sequences ).

What is Unsmoothed N-Gram?.. Example 1: Suppose the corpus is: "I love natural language processing" "Natural language processing is fun“

Some common tasks Tokenization – splitting text into words or phrases. Lowercasing – converting all words to lowercase. Stemming or lemmatization – reducing words to their base form. Bag of Words ( BoW ) – counting occurrences of each word. N-grams – grouping sequences of words.

Tokenize and lowercase "I love natural language processing" → ["i", "love", "natural", "language", "processing"] "Natural language processing is fun" → ["natural", "language", "processing", "is", "fun"]

Build vocabulary(V) Unique words: ["i", "love", "natural", "language", "processing", "is", "fun"]

Count occurrences

Example 1: Calculate Bigram probabilities (UNSMOOTHED) For same above example Sentence 1: "I love natural language processing“ Sentence 2: "Natural language processing is fun"

Solution Sentence 1 → ['i', 'love', 'natural', 'language', 'processing'] Sentence 2 → ['natural', 'language', 'processing', 'is', 'fun']

Solution…. Now list all bigrams:

Solution… Count occurrences:

Solution… Unigram counts (for denominator in probabilities):

Solution… Calculate bigram probabilities:

Final Solution Finally, Now compute them one by one:

Note Unsmoothed means we are directly using the observed counts without making any adjustments. If a word pair never appeared in the data, its probability would be . 2) Smoothing techniques , like Laplace smoothing , are used to avoid zero probabilities by adding small counts artificially.

Why do we need smoothing? Data sparsity – In real-world text, many valid word sequences don’t appear in the training set. Avoiding zero probabilities – Without smoothing, unseen sequences would be impossible, even if they are likely in natural language. Generalization – It helps the model handle new text better by assuming some uncertainty.

1. Add-One (Laplace) Smoothing Where: V Denotes total number of unique words

Example:1 Based on Laplace Smoothing Suppose we have a biagram “Processing fun” Biagram : ('processing', 'fun’) Clearly It never appeared, without smoothing it would be 0.

Solution ( Based on Laplace Smoothing add 1) With Laplace smoothing:

Add-k Smoothing (Generalization of Laplace) Instead of adding 1, we add a small constant k (say), Generally k= 0.5 or 0.01.

Example 2: Based on Add-k Smoothing (Generalization of Laplace) This is useful when adding too much weight (like 1) to unseen events distorts the model.

Backoff Smoothing If a Ngram’s count is zero, it backs off to the N-1 gram, and if that’s zero, it backs off to the N-2 gram and so on. Example: If a trigram’s count is zero, it backs off to the bigram, and if that’s zero, it backs off to the unigram.

Example 1:…. Now, If we check "processing love," which never appears, its probability is 0 .

Important Key Points Unsmoothed N-grams are simple to calculate and understand but face zero probability for unseen sequences. This affects the model's ability to generalize to new text. Smoothing techniques are used to fix this zero-probability problem .

Smoothing in NLP A technique to fix the zero-probability problem caused by unsmoothed N-gram models. When an N-gram sequence does not appear in the training data, its unsmoothed probability is zero, which is problematic for predicting or generating new text because it suggests that the sequence is impossible. Smoothing assigns a small, non-zero probability to these unseen sequences so the model can better generalize.

Explanation Using the Previous Example: Example 1: The bigram example with corpus sentences: "I love natural language processing" "Natural language processing is fun" Here, Unsmoothed bigram probabilities gave zero probability to an unseen bigram like " processing love ."

Note: Vocabulary Size V Suppose the corpus consists of two sentences: "I love natural language processing“. "Natural language processing is fun“ First, list all the distinct words (unique tokens) appearing in these sentences: I love natural language processing is fun Count the unique words: There are 7 unique words. If we consider case differences, the vocabulary size might increase. For example, if "Natural" and "natural" are treated distinctly (case-sensitive), that makes 8 unique tokens.

Backoff smoothing Backoff smoothing in NLP is a method used to handle zero probabilities for unseen n-grams by "backing off" to lower-order n-grams when the higher-order n-gram count is zero. It means if you don't have data for a specific 3-word sequence (trigram), use the 2-word sequence (bigram) instead; if that's missing, use the 1-word probability (unigram).

Explanation of Backoff with an Example Example: 1 Suppose the corpus is: "I love natural language processing" "Natural language processing is fun" Now, We want to predict the probability of a trigram (three-word sequence) like i) ("language", "processing“, “is”) and ii) ("love", “natural“, “fun”)

Solution Step 1 – Tokenize and list bigrams & trigrams in sentence 1 and sentence 2

Solution.. Bigram and trigram counts:

Solution… Trigrams:

Solution (i) Now,for i) ("language", "processing“, “is”) Calculate P( is ∣ language,processing )

Solution (ii) Backoff required Backoff required here, Since, trigram count: Now, Since this trigram never occurred, we backoff to the bigram P( fun∣natural )

Solution (ii) Check unigram count:

Solution (ii) Final result using backoff smoothing

Example: 1… Count of trigram ("natural language processing") = 2 Count of bigram ("language processing") = 3 Count of unigram ("processing") = 4 Now consider trigram (" processing is fun "): Count of trigram ("processing is fun") = 0 (unseen)

Example: 1… Since the trigram count is zero, backoff means: Instead of using trigram, check the bigram (" is fun ") count. If bigram is available, use its probability. If bigram is also zero, then use unigram (" fun ") probability. This way, backoff avoids zero probability issues by falling back to shorter histories.

Note This is not the standard way to compute conditional probabilities like bigrams or trigrams. It’s sometimes used in very basic frequency models, but it doesn’t correctly represent the true probability distribution that model's language. T his formula is used when We want the joint probability of a bigram, rather than the conditional probability.

Note… Example: Suppose If the total number of bigrams is 7, and the bigram ('language', 'processing') appears 2 times, then: P(′ language′,′processing ′)=2/10 =0.2 This tells us that this pair represents 20% of all bigrams in the dataset.

Note

Conclusion Use the joint probability formula when you're analyzing how frequent a bigram is relative to all bigrams. Use the conditional probability formula when you're modeling language and want to predict the next word based on the previous one. Hence, in NLP language modeling we use this approach.

UNIT 1- C-1.pptx natural language processing

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

UNIT 1- C-1.pptx natural language processing

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx