Unigram_Bigram_Spam_Detection limitation.pptx

DrNSumathiN 1 views 8 slides Oct 23, 2025
Slide 1
Slide 1 of 8
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8

About This Presentation

Unigram and Bigram models are types of n-gram language models used in Natural Language Processing (NLP) to represent text statistically by examining the sequence of words.


Slide Content

Unigram and Bigram Models in Spam Detection - Gen AI Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Unigram Model – Simplest Approach • Counts the frequency of each individual word (unigram) in emails. • Creates a feature vector with one entry per word in the language. • Example: If a language has 100,000 words, the feature vector has 100,000 entries. • Most entries are zero because each email has only a few words. → Not space efficient and does not consider word order. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Limitations of Unigram Model • Ignores word order — meaning and context are lost. • Example: 'Win a free offer' and 'Offer a free win' are treated the same. • Fails to capture relationships between words. • Hence, not suitable for spam detection on its own. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Bigram Model • Considers pairs of consecutive words (bigrams). • Captures simple word relationships and context. • Number of possible features = (Number of words)² → very large. • Example: For 100,000 words → 10 billion possible bigrams! → Still many zero entries but captures more meaningful information. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Improving Bigram Models To reduce feature space: - Remove very common bigrams (e.g., 'of the', 'in the'). - Remove nonsensical or rare bigrams that will never appear. • This keeps only meaningful, informative bigrams. • Improves accuracy and reduces processing time. Dr.N.Sumathi, Sri Ramakrishna College of Arts & Science

Other Important Features Used • Sender email address. • Time of email delivery. • Subject line content. • Recipient preferences and past history. • Frequency of messages from a server. → The system updates regularly to adapt to new spam patterns. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Classification Techniques & Accuracy • Common classifiers used: - k-Nearest Neighbors (KNN) - Support Vector Machines (SVM) - Neural Networks • Accuracy depends on chosen features. • Commercial systems achieve ≈99% accuracy using ~100 well-selected features. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science

Summary • Unigram models are simple but lose word order and context. • Bigram models capture relationships between words but increase feature size. • Effective spam detection combines multiple features beyond word frequency. • Regular updates ensure systems adapt to new spam tactics. → Modern spam filters reach around 99% accuracy. Reference : Generative AI and LLMs for Dummies by David Baum Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science