Unigram and Bigram models are types of n-gram language models used in Natural Language Processing (NLP) to represent text statistically by examining the sequence of words.
Size: 48.75 KB
Language: en
Added: Oct 23, 2025
Slides: 8 pages
Slide Content
Unigram and Bigram Models in Spam Detection - Gen AI Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science
Unigram Model – Simplest Approach • Counts the frequency of each individual word (unigram) in emails. • Creates a feature vector with one entry per word in the language. • Example: If a language has 100,000 words, the feature vector has 100,000 entries. • Most entries are zero because each email has only a few words. → Not space efficient and does not consider word order. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science
Limitations of Unigram Model • Ignores word order — meaning and context are lost. • Example: 'Win a free offer' and 'Offer a free win' are treated the same. • Fails to capture relationships between words. • Hence, not suitable for spam detection on its own. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science
Bigram Model • Considers pairs of consecutive words (bigrams). • Captures simple word relationships and context. • Number of possible features = (Number of words)² → very large. • Example: For 100,000 words → 10 billion possible bigrams! → Still many zero entries but captures more meaningful information. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science
Improving Bigram Models To reduce feature space: - Remove very common bigrams (e.g., 'of the', 'in the'). - Remove nonsensical or rare bigrams that will never appear. • This keeps only meaningful, informative bigrams. • Improves accuracy and reduces processing time. Dr.N.Sumathi, Sri Ramakrishna College of Arts & Science
Other Important Features Used • Sender email address. • Time of email delivery. • Subject line content. • Recipient preferences and past history. • Frequency of messages from a server. → The system updates regularly to adapt to new spam patterns. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science
Classification Techniques & Accuracy • Common classifiers used: - k-Nearest Neighbors (KNN) - Support Vector Machines (SVM) - Neural Networks • Accuracy depends on chosen features. • Commercial systems achieve ≈99% accuracy using ~100 well-selected features. Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science
Summary • Unigram models are simple but lose word order and context. • Bigram models capture relationships between words but increase feature size. • Effective spam detection combines multiple features beyond word frequency. • Regular updates ensure systems adapt to new spam tactics. → Modern spam filters reach around 99% accuracy. Reference : Generative AI and LLMs for Dummies by David Baum Dr.N.Sumathi , Sri Ramakrishna College of Arts & Science