Natural language processing unit - 2 ppt

Hshhdvrjdnkddb 2 views 82 slides Mar 10, 2025
Slide 1
Slide 1 of 82
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82

About This Presentation

nlp unit 2


Slide Content

UNIT-II -NLP

syllabus Introduction to word types: word2Vec, Word Embedding, POS Tagging, Count Vectorizer , Multiword Expressions the role of language models. Simple N-gram models. Bag of words, estimating parameters and smoothing. Evaluating language models.

Word Embedding Word embeddings are a way of representing words in a vector space (a list of numbers). They are numerical representations of words that capture their meanings, relationships, and contexts. Word embeddings allow words with similar meanings to have similar representations in the vector space. In simpler terms, instead of representing words as simple strings like "apple" or "dog," we represent them as vectors of numbers, such that similar words (like "cat" and "dog") will have similar vectors.

Why Do We Need Word Embeddings? Computers cannot understand human language directly, so we need to convert words into a form they can process. The traditional approach, called one-hot encoding , represents words as sparse vectors (like a long list of zeros and a single 1 at the index representing the word). However, this doesn't capture any relationships between words. Word embeddings solve this problem by placing semantically similar words closer together in the vector space.

Explanation of Word Embeddings: Step 1: Traditional Representation vs. Word Embeddings Traditional Representation (One-Hot Encoding): Example : Let's take three words: "cat" , "dog" , and "fish" . In one-hot encoding , each word would be represented as a vector with as many dimensions as the size of the vocabulary, with a 1 at the index of the word and 0s elsewhere. For a vocabulary of size 3, it might look like this: Word One-Hot Encoding cat[1 , 0, 0]dog[0, 1, 0]fish[0, 0, 1] This representation has a couple of drawbacks: High dimensionality : For large vocabularies, these vectors become very long. No relationship captured : The vector [1, 0, 0] for "cat" is no more related to [0, 1, 0] for "dog" than it is to [0, 0, 1] for "fish". There's no concept of meaning or similarity between words.

Word Embedding Representation: Word embeddings solve this by mapping each word into a continuous vector space, where the distance between vectors represents semantic similarity. For example, the vector for "cat" would be close to the vector for "dog" , but far away from "fish" , because "cat" and "dog" are more semantically related. WordWord Embedding (Example)cat[0.2, 0.4, 0.6]dog[0.1, 0.5, 0.7]fish[0.9, 0.3, 0.2]. In this example, "cat" and "dog" have similar vectors, while "fish" has a different vector because it's less related to "cat" and "dog".

Cont.. Step 2: How Are Word Embeddings Learned? Word embeddings are typically learned using machine learning models trained on large text datasets. The goal is to find a representation of each word in such a way that similar words are close together in the vector space. Popular Algorithms for Learning Word Embeddings: Word2Vec (Continuous Bag of Words - CBOW, and Skip-Gram): CBOW predicts a target word based on context words around it. Skip-Gram does the reverse: it predicts the context words based on a target word. Both models use a neural network and learn to adjust the word vectors (embeddings) to make accurate predictions.

Word2Vec Word2Vec is a popular technique in Natural Language Processing (NLP) used to represent words as vectors (numbers). These vectors are created in such a way that words with similar meanings are close together in the vector space, making it easier for computers to understand relationships between words. It was developed by Google in 2013 and is widely used to understand the relationships between words in a text.

Real-World Example: A Restaurant Menu Imagine you're reading a menu at a restaurant. Here's a simplified list of items: Burger, Fries, Coke Pizza, Garlic Bread, Pepsi Sushi, Miso Soup, Green Tea From these combinations, Word2Vec would learn that: "Burger" is similar to "Pizza" because both are paired with other fast-food items. "Coke" is similar to "Pepsi" because they are both drinks that appear with similar foods. "Sushi" is different from "Burger" because it appears in a different context, with Japanese food items like " Miso Soup."

Scenario: Cricket Player and Match Recommendation System Imagine you're using a cricket app that helps you discover cricket players, matches, or even teams you might like based on your preferences and past activity. How does the app know which players or matches to recommend next? The answer is word2vec , which helps the app understand relationships between players, teams, and match types based on their context.

Understanding word2vec in Cricket Let’s say the app treats each cricket player, match, and team as a "word" in a massive "sentence" representing all the cricket data. Just like how words in a sentence have relationships based on their meanings, players and matches also have relationships based on their context. For example: Player 1: Virat Kohli (Batsman, India) Player 2: Rohit Sharma (Batsman, India) Player 3: Joe Root (Batsman, England) Player 4: Ben Stokes (All-rounder, England) Player 5: Jofra Archer (Bowler, England) Player 6: Steve Smith (Batsman, Australia)

How word2vec Works in This Case Now, the app uses a word2vec-like model to convert these players into vectors (numbers). These vectors represent how similar players are to each other based on things like: Role (batsman, bowler, all-rounder) Country (India, England, Australia) Playing style (aggressive, defensive, all-rounder)

Cont.. The app analyzes these contexts and places similar players closer together in the vector space. Here's how it might look: Virat Kohli and Rohit Sharma will be represented by vectors that are close together because both are Indian batsmen who play in a similar aggressive style. Joe Root and Ben Stokes will have vectors that are relatively close because they both play for England , though Root is more of a traditional batsman , and Stokes is an all-rounder . Jofra Archer will be further from players like Kohli and Root because he is a bowler with a completely different role. Steve Smith will be close to Kohli and Sharma in the vector space since he is also a batsman , but his unique style may position him slightly apart.

Why Do We Need Word2Vec? Words, by themselves, are just text. Computers can't directly understand text, so we need to convert words into numbers. But a simple number (like assigning "apple" = 1 and "orange" = 2) doesn't capture the meaning of the words or how they're related. Word2Vec solves this problem by creating vectors (lists of numbers) that represent words in a way that captures their meanings and relationships.

How Does Word2Vec Work? Contextual Representation : Word2Vec is based on the idea that "words that appear in similar contexts have similar meanings." For example, in the sentences "I love apples" and "I love oranges," the words "apples" and "oranges" appear in similar contexts ("I love").

Cont.. Training with Context : Word2Vec uses a neural network to learn word relationships from large text data. It trains in two main ways: CBOW (Continuous Bag of Words) : Predicts a word based on its surrounding words. Skip-Gram : Predicts surrounding words based on a single word. Word Vectors : Each word is represented as a vector of numbers (e.g., [1.2, -0.8, 0.5, ...]). Words with similar meanings or used in similar ways have similar vectors.

CBOW (Continuous Bag of Words) : Continuous Bag of Words (CBOW) model is one of the two main architectures used in Word2Vec (the other being Skip-Gram ). It’s a simple but effective method for learning word embeddings. CBOW predicts a word based on its context (the surrounding words).

Example 1. Start with a Simple Sentence Write a sentence on the board for the students to understand: Sentence : "The cat sat on the mat.“ 2. Explain the Goal of CBOW What is CBOW? CBOW predicts a target word (like "sat") using its context words (like "The", "cat", "on", "the", "mat"). Write on the board: Context words → Predict Target Word

Cont.. 3. Vocabulary Setup Explain that CBOW uses a vocabulary , which is a list of all unique words in the text. Example Vocabulary: "The", "cat", "sat", "on", "mat" Assign an index to each word: The: 0, cat: 1, sat: 2, on: 3, mat: 4 4. Represent Words as Vectors (One-Hot Encoding) Explain that each word is represented as a one-hot vector : A one-hot vector is a list of 0s with a single 1 at the index of the word. The → [1, 0, 0, 0, 0] Cat → [0, 1, 0, 0, 0] Sat → [0, 0, 1, 0, 0]

Cont.. 5. Training Example Write an example of how CBOW uses context words to predict a target word: Context Words: ["The", "cat", "on", "the", "mat"] Target Word: "sat"

6. Step-by-Step CBOW Process Step 1: Initialize word vectors Each word is represented by a random vector at the beginning of the training. Suppose we have the following random vector for each word (simplified for clarity):

Step 2: Calculate the context vector Now, we average the vectors of the context words ("The", "cat", "sits", "on"):

Cont.. Step 3: Calculate the prediction Now, we use the context vector to predict the target word. This is done by multiplying the context vector with a weight matrix and passing it through a softmax function to calculate the probability of each word being the target. The model will adjust its weights during training to make the predicted target word more likely.

Cont.. 3. Compute the Prediction Scores (Dot Product) Now, we take the context vector and multiply it by the weight matrix to get prediction scores. This is done by taking the dot product of the context vector with each word's vector in the vocabulary. For example, let’s compute the dot product between the context vector and the vector of each word in the vocabulary: Context vector: [0.25,0.275,0.3][0.25, 0.275, 0.3][0.25,0.275,0.3]

"The": 0.17 "cat": 0.1975 "sits": 0.3275 "on": 0.2175 "mat": 0.2125 These values are called scores , and they indicate how well each word in the vocabulary fits the context.

Step 2: Compute the Sum of Exponentials Add up all the exponential values: Sum=1.1852+1.2184+1.3875+1.2430+1.2370=6.2711

Cont.. Step 4: Verify the Results The softmax values are: [0.1890,0.1943,0.2213,0.1982,0.1972] Finally, check that they sum to 1.0: 0.1890+0.1943+0.2213+0.1982+0.1972=1.0

Skip-gram is a model Skip-gram is a model used to predict context words based on a target word . Instead of predicting the target word from its surrounding context, the Skip-gram model works the other way around: it tries to predict the surrounding context words from a given target word.

Applications Search Engines : Understanding synonyms and related words. Recommendation Systems : Finding similar items or concepts. Text Classification : Helping classify text into categories.

POS (Part of Speech) POS (Part of Speech) tagging is a process in natural language processing where words in a sentence are labeled with their corresponding parts of speech, such as nouns, verbs, adjectives, adverbs, etc. The goal is to understand the role of each word in a sentence to help computers analyze the meaning.

Here’s an easy explanation: Nouns (N) : Words that name people, places, things, or ideas. Example: dog , city , happiness Verbs (V) : Words that show actions, occurrences, or states of being. Example: run , is , seem Adjectives (ADJ) : Words that describe or modify nouns. Example: happy , blue , tall Adverbs (ADV) : Words that describe or modify verbs, adjectives, or other adverbs. Example: quickly , very , gently Pronouns (PRON) : Words that take the place of nouns. Example: he , she , they Prepositions (PREP) : Words that show the relationship between a noun (or pronoun) and other parts of the sentence. Example: in , on , under Conjunctions (CONJ) : Words that connect words, phrases, or clauses. Example: and , but , or Interjections (INTJ) : Words that show strong emotions or feelings. Example: wow , ouch , hey

For example, in the sentence "The quick brown fox jumps over the lazy dog," the POS tags would be: "The" – Determiner (often classified as a type of noun modifier) "quick" – Adjective "brown" – Adjective "fox" – Noun "jumps" – Verb "over" – Preposition "the" – Determiner "lazy" – Adjective "dog" – Noun By tagging these parts of speech, computers can better understand the structure of a sentence and its meaning.

Methods of POS Tagging Rule-Based : Used for specific, structured text where rules are predefined (e.g., legal or scientific documents). Statistical : Works well for large datasets and can predict POS tags based on observed word patterns (e.g., news articles, blogs). Machine Learning : Suitable for dynamic contexts where patterns emerge from examples, used in translation and grammar-checking tools. Deep Learning : Best for handling complex, ambiguous sentences and long-term context, used in voice assistants, search engines, and advanced chatbots .

Count Vectorizer A Count Vectorizer is a tool used in natural language processing (NLP) to transform a collection of text into a matrix of token counts. Here's a simple breakdown of how it works: 1.Text Input : Imagine you have a few documents (or sentences). For example: "I love programming." "Programming is fun." "I love learning new things." 2.Tokenization : The Count Vectorizer splits each sentence into individual words (tokens). For the above sentences, the tokens would be: "I", "love", "programming" "Programming", "is", "fun" "I", "love", "learning", "new", "things"

Cont.. 3.Vocabulary Creation : It then creates a list of all unique words across the documents (vocabulary): Now, CountVectorizer creates a list of unique words in alphabetical order and assigns an index: "I", "love", "programming", "is", "fun", "learning", "new", "things" 4.Count Matrix : For each sentence, it counts how many times each word (from the vocabulary) appears. This is represented as a matrix (or table) where: Each row represents a sentence. Each column represents a word from the vocabulary. The cell values show the count of each word in that sentence.

Simple N-gram models N-gram models are a basic type of statistical language model used in natural language processing (NLP) to predict the likelihood of a sequence of words.

What is an N-gram? An N-gram is a contiguous sequence of N items (typically words or characters) from a given sample of text or speech. Unigram : Single word (e.g., "I", "like", "cats"). Bigram : Two consecutive words (e.g., "I like", "like cats"). Trigram : Three consecutive words (e.g., "I like cats"). N-gram : A general term for NNN consecutive words (e.g., "I really like cats" is a 4-gram or " quadgram ").

How do N-gram models work? Count Frequencies : An N-gram model starts by counting how often specific N-grams appear in a large corpus of text. Example: For bigrams, calculate how often "I like", "like cats", etc., occur. Probability Estimation : The model estimates the probability of the next word based on the previous N−1N-1N−1 words. For a bigram model : P( cats∣like )=Count(like cats)/Count(like) For a trigram model : P( cats∣I  like)=Count(I like cats)/Count(I like) Predict Next Word : Using these probabilities, the model predicts the most likely next word in a sequence.

Example 1: Bigram Model Let's say we have the following text: "I like cats. I like dogs. I like ice cream." We count the bigrams (two-word sequences): Bigram Count "I like" 3 "like cats" 1 "like dogs" 1 "like ice" 1

Cont.. Now, we calculate the probability of "cats" given "like" : P("cats"∣"like")=Count("like cats")/Count("like") P("cats"∣"like")=1/3=0.33 Similarly: P("dogs"∣"like")=1/3=0.33 P("ice"∣"like")=1/3=0.33 So, if we see "I like" , the model predicts "cats", "dogs", or "ice" with equal probability.

Predicting Next Word Input: "I like" Bigram Probabilities: "cats" → 33% "dogs" → 33% "ice" → 33% Predicted Next Word: The model randomly selects from cats, dogs, or ice .

Bag of words What is Bag of Words ( BoW )? Bag of Words is a simple way to convert text into numbers ( vectorization ) so that a computer can understand and process it. It does this by counting how often each word appears in a document.

Example to Understand BoW Easily Step 1: Sample Sentences Imagine we have two simple sentences: 1.” I love apples and oranges." 2” Apples and bananas are tasty.“ Step 2: Create a Vocabulary We list all unique words from both sentences: ["I", "love", "apples", "and", "oranges", "bananas", "are", "tasty"]

Step 3: Count Word Occurrences Now, we create a table to count how many times each word appears in each sentence.

Cont.. Each row represents a word, and each column shows how many times that word appears in a sentence. Step 4: Represent as a Vector Each sentence is now converted into a numerical vector: Sentence 1 → [1, 1, 1, 1, 1, 0, 0, 0] Sentence 2 → [0, 0, 1, 1, 0, 1, 1, 1] Now, the computer can process these numbers instead of raw text!

cont,.. 3. Advantages & Disadvantages ✔ Advantages: Works well for simple text classification tasks. Easy to implement and interpret. ❌ Disadvantages: Doesn’t consider the order or meaning of words (e.g., "not good" and "good" are treated similarly). Large vocabulary can make computation expensive. Can lead to sparse matrices , meaning lots of zeros in the vector representation.

Cont.. 4. How to Improve BoW ? To make BoW more effective, we can use: TF-IDF (Term Frequency-Inverse Document Frequency) – Gives importance to rare words instead of common ones. N-grams – Considers sequences of words instead of single words (e.g., "Machine Learning" as one unit). Word Embeddings (like Word2Vec, GloVe ) – Captures the meaning and context of words.

TF-IDF (Term Frequency-Inverse Document Frequency) TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in Natural Language Processing (NLP) and information retrieval to evaluate the importance of a word within a document relative to a corpus of documents.

Components of TF-IDF 1. Term Frequency (TF)

2. Inverse Document Frequency (IDF)

TF-IDF Score What it is : It combines TF and IDF to give a score to each word that reflects its importance in a document relative to the entire corpus. Formula : TF-IDF(w)=TF(w)×IDF(w) Why it matters : Words that are frequent in a document but rare across all documents will have a high TF-IDF score, indicating they are important. Example : Continuing the previous example: TF("apple") = 0.05 IDF("apple") = 2 TF-IDF(apple)=0.05×2=0.1

Why is TF-IDF useful? Words that are very common across many documents (like "the," "is," "and") will have a low IDF value and will not contribute much to the TF-IDF score. Words that are frequent in a specific document but rare in others will have a high TF-IDF score, meaning they are important keywords for that document.

New Documents: Document 1 : "The sun is bright." Document 2 : "The moon is bright at night." Document 3 : "The stars are bright in the sky." Step 1: Calculate Term Frequency (TF) We'll calculate the TF for the words "sun," "moon," "bright," "night," "stars," "sky," and "the" in each document. Document 1: "The sun is bright." Total words = 4 ("The", "sun", "is", "bright") The appears 1 time, Sun appears 1 time, Bright appears 1 time. TF(the)=1/4=​=0.25 TF(sun)=1/4=0.25 TF(bright)=1/4=0.25

Document 2: "The moon is bright at night." Total words = 6 ("The", "moon", "is", "bright", "at", "night") The appears 1 time, Moon appears 1 time, Bright appears 1 time, Night appears 1 time. TF(the)=1/6≈0.1667 TF(moon)=1/6≈0.1667 TF(bright)=1/6≈0.1667 TF(night)=1/6≈0.1667 Document 3: "The stars are bright in the sky." Total words = 7 ("The", "stars", "are", "bright", "in", "the", "sky") The appears 2 times, Stars appears 1 time, Bright appears 1 time, Sky appears 1 time. TF(the)=2/7 ≈0.2857 TF(stars)=1/7≈0.1429 TF(bright)=1/7≈0.1429 TF(sky)=1/7≈0.1429

Step 2: Calculate Inverse Document Frequency (IDF): Now, we calculate the IDF for the words "sun," "moon," "bright," "night," "stars," and "sky." Total number of documents = 3. Now, count how many documents contain each word: Sun appears in Document 1 (1 document). Moon appears in Document 2 (1 document). Bright appears in all 3 documents (3 documents). Night appears in Document 2 (1 document). Stars appears in Document 3 (1 document). Sky appears in Document 3 (1 document). The appears in all 3 documents (3 documents).

Cont.. We can now calculate the IDF for each word: IDF(sun) = log⁡(3/1)=log⁡(3)≈0.4771 IDF(moon) = log⁡(3/1)=log⁡(3)≈0.4771 IDF(bright) = log⁡(3/3)=log⁡(1)=0 IDF(night) = log⁡(3/1)=log⁡(3)≈0.4771 IDF(stars) = log⁡(3/1)=log⁡(3)≈0.4771 IDF(sky) = log⁡(3/1)=log⁡(3)≈0.4771 IDF(the) = log⁡(3/3)=log⁡(1)=0

Step 3: Calculate TF- IDF: Now , let's calculate the TF-IDF for each word in each document. Document 1: "The sun is bright." TF-IDF(the, Doc 1) = 0.25×0=0 TF-IDF(sun, Doc 1) = 0.25×0.4771=0.1193 TF-IDF(bright, Doc 1) = 0.25×0=0 Document 2: "The moon is bright at night." TF-IDF(the, Doc 2) = 0.1667×0=0 TF-IDF(moon, Doc 2) = 0.1667×0.4771=0.0795 TF-IDF(bright, Doc 2) = 0.1667×0=0 TF-IDF(night, Doc 2) = 0.1667×0.4771=0.0795 Document 3: "The stars are bright in the sky." TF-IDF(the, Doc 3) = 0.2857×0=0 TF-IDF(stars, Doc 3) = 0.1429×0.4771=0.0681 TF-IDF(bright, Doc 3) = 0.1429×0=0 TF-IDF(sky, Doc 3) = 0.1429×0.4771=0.0681

Applications Ranking documents in search engines Classifying documents into categories Identifying keywords Detecting plagiarism

Multiword Expressions (MWEs) Multiword Expressions (MWEs) refer to groups of words that together form a single unit of meaning, but their meaning is different from the individual words in the expression. These expressions often cannot be understood by simply looking at the meaning of each word separately. Examples: "kick the bucket" – This means "to die," but if you take the words literally, they have no connection to death. "break the ice" – This means to start a conversation or make people feel more comfortable, not actually breaking ice. "by and large" – This means "generally speaking," not referring to size or scale.

What Are MWEs? Fixed phrases : These are expressions that have a fixed structure and meaning. They often don’t follow regular grammar rules. Example: "kick the bucket" (which means "to die," not literally kicking a bucket). Collocations : These are word pairs or groups of words that often appear together, and their meaning is understood by frequent use in the language. Example: "fast food" (fast and food are commonly used together, but "fast" doesn’t describe food directly). Idiomatic expressions : These are phrases whose meanings are not derived from the meanings of the individual words. Example: "break a leg" (meaning "good luck"). Phrasal verbs : These are verbs combined with prepositions or adverbs to create a new meaning. Example: "give up" (meaning "quit").

Why Are MWEs Important in NLP? Ambiguity resolution : In regular language processing, breaking a sentence into individual words can lead to confusion, but MWEs help reduce this ambiguity. Improving understanding : MWEs help machines better understand human language, as they reflect the actual usage of words in context.

Examples: "New York" : It’s not just the words “new” and “ york ,” but a location (a proper noun). "Take a break" : The meaning is not about physically taking a break, but about resting. Types of MWEs: Fixed Expressions : "once in a blue moon" Phrasal Verbs : "look up" Collocations : "strong tea" By recognizing MWEs, NLP systems can better understand language and handle tasks like translation, sentiment analysis, and question answering more effectively!

Applications Machine Translation : Correct translation of phrases like “kick the bucket” (to die) by treating MWEs as a whole unit. Speech Recognition : Recognize phrases like “good morning” without misinterpreting individual words. Sentiment Analysis : Detect true sentiment in phrases like “break a leg” (good luck), not just the words “break” and “leg.” Named Entity Recognition (NER) : Identify multiword entities like “United Nations” as a single entity. Question Answering (QA) : Interpret phrases like “capital of France” correctly in a question. Social Media Analysis : Understand hashtags or informal phrases (# BlackFriday ) in trend detection.

Role of Language Models What is a Language Model? A language model is a system designed to predict and understand the structure and meaning of human language. It uses patterns from large amounts of text to guess what words or sentences are most likely to come next in a given context.

How Does It Work? Language models are trained using tons of text, such as books, articles, and websites. They learn patterns like: Word prediction : For example, if the sentence is "The cat is on the ___," the model can predict the word "mat." Context understanding : It can also understand the meaning of a word based on the words around it. For example, the model knows that "bank" could mean a place to store money or the side of a river, depending on the surrounding words.

Why Are Language Models Important in NLP? Language models are the backbone of many NLP tasks, such as: Text Generation : They help generate new text that sounds natural (e.g., writing essays, creating chatbot responses). Translation : Translating text from one language to another (e.g., Google Translate). Sentiment Analysis : Determining if a piece of text is positive, negative, or neutral (e.g., customer reviews). Speech Recognition : Converting spoken language into written text (e.g., Siri , Google Assistant).

4.Types of Language Models: Statistical Models : Earlier models relied on counting word frequencies and probabilities. Neural Networks : Modern models, like GPT (Generative Pre-trained Transformer), use deep learning to understand and generate language more effectively. 5. Real-Life Examples: Chatbots : Like the one you're interacting with, where the model predicts the best response based on your input. Voice Assistants : Alexa , Siri , and Google Assistant use LMs to understand commands and answer questions. Text Autocompletion : When typing a message, predictive text suggests the next word or phrase based on your previous words.

Estimating Parameters and Smoothing Estimating Parameters (Finding Probabilities) When we try to predict the next word in a sentence, we need to know how often words appear together. This is called estimating probabilities. For example, in a bigram model (where we predict one word based on the previous one), we calculate:

Example

What is Smoothing? Smoothing is a trick to handle words or phrases that we haven't seen before in our data. Imagine you have a predictive text keyboard . You've seen the phrase "I love chocolate cake" many times. But you've never seen "I love chocolate pizza" in your data. Without smoothing, the keyboard will think "chocolate pizza" is impossible and never suggest it! . Smoothing fixes this! It makes sure everything gets at least a small probability , even if we haven't seen it before.

How Does Smoothing Work? Smoothing adds a small value to every count, so nothing is completely zero . Example: Add-One (Laplace) Smoothing We just add 1 to every word count! 🎉 Without Smoothing: If your data has: "chocolate cake" appeared 5 times "chocolate pizza" appeared 0 times We calculate probability: P("pizza"∣"chocolate")=0/Total times "chocolate" appears chocolate pizza" is impossible (which is not true).

Why Does Smoothing Matter? Helps chat bots, voice assistants, and search engines handle new words. Improves spell check, predictive text, and auto-correct. Makes translations and speech recognition more accurate. Ensures NLP models don’t break when they see something new. Without smoothing, models would fail at handling rare or unseen words. With smoothing, they stay smart and flexible! .
Tags