Unigram_Bigram_Spam_Detection limitation.pptx

DrNSumathiN 1 views 8 slides Oct 23, 2025

Slide 1 of 8

About This Presentation

Unigram and Bigram models are types of n-gram language models used in Natural Language Processing (NLP) to represent text statistically by examining the sequence of words.

Size: 48.75 KB

Language: en

Added: Oct 23, 2025

Slides: 8 pages

Slide Content

Unigram and Bigram Models in Spam Detection - Gen AI Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Unigram Model – Simplest Approach • Counts the frequency of each individual word (unigram) in emails. • Creates a feature vector with one entry per word in the language. • Example: If a language has 100,000 words, the feature vector has 100,000 entries. • Most entries are zero because each email has only a few words. → Not space efficient and does not consider word order. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Limitations of Unigram Model • Ignores word order — meaning and context are lost. • Example: 'Win a free offer' and 'Offer a free win' are treated the same. • Fails to capture relationships between words. • Hence, not suitable for spam detection on its own. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Bigram Model • Considers pairs of consecutive words (bigrams). • Captures simple word relationships and context. • Number of possible features = (Number of words)² → very large. • Example: For 100,000 words → 10 billion possible bigrams! → Still many zero entries but captures more meaningful information. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Improving Bigram Models To reduce feature space: - Remove very common bigrams (e.g., 'of the', 'in the'). - Remove nonsensical or rare bigrams that will never appear. • This keeps only meaningful, informative bigrams. • Improves accuracy and reduces processing time. Dr.N.Sumathi, Sri Ramakrishna College of Arts & Science

Other Important Features Used • Sender email address. • Time of email delivery. • Subject line content. • Recipient preferences and past history. • Frequency of messages from a server. → The system updates regularly to adapt to new spam patterns. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Classification Techniques & Accuracy • Common classifiers used: - k-Nearest Neighbors (KNN) - Support Vector Machines (SVM) - Neural Networks • Accuracy depends on chosen features. • Commercial systems achieve ≈99% accuracy using ~100 well-selected features. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Summary • Unigram models are simple but lose word order and context. • Bigram models capture relationships between words but increase feature size. • Effective spam detection combines multiple features beyond word frequency. • Regular updates ensure systems adapt to new spam tactics. → Modern spam filters reach around 99% accuracy. Reference : Generative AI and LLMs for Dummies by David Baum Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Unigram_Bigram_Spam_Detection limitation.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Unigram_Bigram_Spam_Detection limitation.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx