This unit introduces the fundamental concepts of Natural Language Processing (NLP), covering its history, applications, and challenges. It explains language components like phonology, morphology, syntax, semantics, and pragmatics. Students learn about different NLP approaches, rule-based, statistica...
This unit introduces the fundamental concepts of Natural Language Processing (NLP), covering its history, applications, and challenges. It explains language components like phonology, morphology, syntax, semantics, and pragmatics. Students learn about different NLP approaches, rule-based, statistical, and machine learning and get an overview of NLP tools and resources. This unit builds the foundation for all subsequent topics in NLP.
Size: 388.46 KB
Language: en
Added: Oct 09, 2025
Slides: 4 pages
Slide Content
MOSIUOA WESI – ANDHRA UNIVERSITY – VISAKHAPATNAM 530001
NLP - Unit I: Introduction and Language Modeling
1. Overview
Natural Language Processing (NLP) is the field that focuses on the interaction between
computers and human (natural) languages. It combines linguistics, computer science, and
artificial intelligence to enable machines to understand, interpret, generate, and respond to
human language in a valuable way.
2. Origins and Historical Context
• Early roots in symbolic AI and linguistics (1950s–1980s).
• Shift toward statistical methods in the 1990s with availability of large corpora and
computing power.
• Recent surge from deep learning (2010s onward) enabling high-performance models for
many NLP tasks.
3. Main Challenges of NLP
Ambiguity: Lexical (word sense ambiguity), syntactic (structural ambiguity), and
semantic (multiple interpretations).
Variability: Many ways to express the same meaning (paraphrase, synonyms, dialects).
Context & World Knowledge: Language often requires real-world facts and pragmatics.
Noisy Text: Typos, informal language, misspellings, and social media text.
Resource Limitations: Low-resource languages lack annotated corpora and lexicons.
Multilinguality & Domain Adaptation: Models trained on one domain/language may not
generalize.
4. Language Modeling
A language model (LM) assigns probabilities to sequences of words. LMs are fundamental in
many NLP applications (speech recognition, machine translation, spell checking, text
generation).
Two broad categories:
Grammar-based Language Models: Use linguistically motivated rules and grammars
(e.g., context-free grammars) to define allowable sentences. Good for interpretability
and formal syntactic constraints; however, brittle and require handcrafted rules.
Statistical Language Models: Learn probabilities from corpora. The classic example is
the n-gram model which estimates P(w1...wn) using the chain rule and Markov
assumptions.
Key ideas for statistical LMs:
Chain rule: P(w1...wn) = Π_i P(w_i | w_1...w_{i-1}).
n-gram approximation: P(w_i | w_1...w_{i-1}) ≈ P(w_i | w_{i-(n-1)}...w_{i-1}). (e.g.,
bigrams, trigrams)
Parameter estimation uses counts from corpora (maximum likelihood estimation).
Smoothing techniques (to handle zero counts) include add-one, Good-Turing, backoff
and interpolation.
Evaluation: perplexity is commonly used to measure how well a model predicts a held-
out set.
5. Regular Expressions (Regex) & Finite-State Automata (FSA)
Regular expressions are patterns used to match text. They are widely used for tokenization,
simple pattern extraction (emails, URLs), and quick text normalization.
Example regex: \b\w+\b matches a simple word token (word boundary + one or more
word characters).
More advanced patterns handle contractions (don't → do n't or don't depending on
tokenization policy).
Finite-State Automata (FSA) are computational structures (DFA/NFA) that recognize
regular languages. They are closely related to regex and are used to implement fast
tokenizers, lexicons, and morphological analyzers.
6. English Morphology & Transducers for Lexicon and Rules
Morphology studies the internal structure of words. Key elements:
Morpheme: smallest meaningful unit (root/stem, prefixes, suffixes).
Inflectional morphology: changes to express grammatical features (e.g., walk → walks,
walk → walked).
Derivational morphology: creates new words and may change the part of speech (e.g.,
govern → government).
Finite-State Transducers (FSTs) are automata that map between two sets of symbols (often
lexical ↔ surface forms). They are powerful for modeling morphology because they can
compactly encode alternations, affixation rules, and orthographic changes.
Morphological analysis (breaking 'running' → run + -ing).
Morphological generation (building surface forms from lemma + features).
Spelling normalization, handling irregular forms.
7. Tokenization
Tokenization is the process of splitting a stream of text into tokens (words, punctuation,
numbers). Although it sounds trivial, correct tokenization is crucial and language-
dependent.
Simple whitespace tokenization fails for languages without spaces (e.g., Chinese) and for
handling punctuation and contractions.
Decisions to make: how to handle punctuation (keep or strip), contractions (split or
keep), hyphenation, URLs, emoticons, and numbers.
Tools often use rule-based tokenizers, regex-based tokenizers, or statistical tokenizers
trained on annotated corpora.
8. Detecting and Correcting Spelling Errors
Two common error categories:
Non-word errors: token not in dictionary (e.g., 'teh' instead of 'the').
Real-word errors: a valid word used incorrectly (e.g., 'their' vs 'there').
Approaches to correction:
Dictionary lookup & candidate generation (via edit-distance or phonetic similarity).
Noisy Channel Model: choose correction c that maximizes P(c)P(w|c), where P(c) is a
language model (prior) and P(w|c) is the error model (probability that c becomes
observed w).
Ranking candidates using language model scores and error probabilities.
9. Minimum Edit Distance (Levenshtein Distance)
Minimum edit distance quantifies how many insertions, deletions, or substitutions are
required to transform one string into another. The Levenshtein distance uses unit cost for
each of these operations and is computed with dynamic programming.
Example: compute distance between 'kitten' and 'sitting' (classic example):
Minimum edit distance between 'kitten' and 'sitting' = 3. (Sequence of edits: kitten → sitten
(sub k→s) → sittin (sub e→i) → sitting (insert g)).
DP matrix (rows = prefixes of source 'kitten', columns = prefixes of target 'sitting'):
s i T t i n g
0 1 2 3 4 5 6 7
k 1 1 2 3 4 5 6 7
i 2 2 1 2 3 4 5 6
t 3 3 2 1 2 3 4 5
t 4 4 3 2 1 2 3 4
e 5 5 4 3 2 2 3 4
n 6 6 5 4 3 3 2 3
10. Practical Notes and Tips
Start with strong preprocessing: consistent tokenization, normalization (lowercasing,
Unicode normalization), and dealing with noisy characters.
Use language models to help disambiguate corrections and to score sequences.
For morphology-heavy languages, invest in FSTs or morphological analyzers to reduce
sparsity.
Document the tokenization and normalization pipeline used in any experiment—results
are sensitive to these choices.
11. Quick References & Next Steps
Suggested next topics to study (appear in later units): n-gram smoothing and backoff, part-
of-speech tagging, syntactic parsing, and semantic analysis. Recommended classic texts:
Jurafsky & Martin (Speech and Language Processing) and Manning & Schütze (Foundations
of Statistical Natural Language Processing).