Morphological Analysis and Finite state automata

pixelatedproseai 0 views 48 slides Oct 15, 2025
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

This ppt describes about morphological analysis and finite state automata


Slide Content

C R EATED B Y K. VI CT O R B A B U Computational Linguistics (PROFESSIONAL ELECTIVE II) Topic: Morphological analysis and Finite state automata Dr Dipanwita Debnath Assistant Professor CSE(AIML) Department of CSE ( AIML)

Properties of RL

Definition Morphology is the study of the way words are built up from smaller meaning-bearing units, called morphemes . A morpheme is often defined as the minimal meaning-bearing unit in a language.

Classification of Morphemes Morphemes are broadly classified into: Stems – the main morpheme of a word that provides the core meaning. Affixes – morphemes that are added to stems to modify or add to the meaning.

Types of Affixes Affixes are further divided into four types: Prefixes : Added before the stem (e.g., un- in unhappy ). Suffixes : Added after the stem (e.g., -s in cats ). Infixes : Inserted inside the stem (rare in English).( Circumfixes : Added both before and after the stem (not common in English but found in other languages). Word: Unbelievable Infix version: Un-freaking-believable (or Un-bloody-believable in British English)

Multiple Affixes & Agglutinative Languages A word can have more than one affix . Example: rewrites = re- (prefix) + write (stem) + -s (suffix) Example: unbelievably = un- , -able , - ly + believe Languages like Turkish can have words with many affixes (9–10), and such languages are called agglutinative languages .

Methods of Word Formation Using Morphemes These processes are vital in speech and language processing : 1. Inflection 2. Derivation 3. Compounding 4. Cliticization

Definition Inflectional Morphology is the branch of morphology that studies how words change form to express grammatical features such as plurality , possession , tense , etc. In English , the inflectional system is relatively simple: Only nouns , verbs , and sometimes adjectives can be inflected. The number of inflectional affixes is quite small . Adds grammatical meaning (not a new word class). Example: cat → cats (plural), walk → walked (past tense) English nouns typically show two kinds of inflection : Plural (e.g., cat → cats ) Possessive (e.g., dog → dog’s )

With inflectional morphology , we can analyze how verbs change form to express grammatical information like tense , aspect , person , and number — without changing the word's core meaning or class. 1. Regular Verbs Verbs that form their past tense and past participle by adding -ed. Examples: walk → walked, jump → jumped 2. Irregular Verbs Verbs that change form completely or in non-standard ways to show tense. Examples: go → went, eat → ate, run → ran 3. Preterite Verbs (Past Tense Verbs) Verbs that show simple past tense. Regular Example: clean → cleaned Irregular Example: sing → sang

Why It Matters in Computational Linguistics: In computational linguistics, understanding inflectional morphology helps us build systems that can: Parse sentences accurately Identify word roots (stems) Generate or recognize different word forms Perform POS tagging, lemmatization, and machine translation

Derivational Morphology In computational linguistics, derivational morphology refers to how words are formed by adding derivational affixes that change the meaning or part of speech of a base word. Different from inflectional morphology , which doesn’t change the word class.

Adjectives can be derived from nouns and verbs through the addition of derivational suffixes . Relevance in NLP : Tokenization must recognize both computerize and computerization as semantically linked. Search engines and IR systems use stemming to link derived words. In machine translation , systems must understand that "organization" and "organize" are related but different word forms.

Compounding in Morphology Compounding is a type of morphological process where two or more independent words are combined to create a new word with a specific meaning . Made by combining two or more roots (lexical morphemes) The resulting compound acts as a single word May have a new or non-transparent meaning

Cliticization What Is a Clitic? A clitic is a unit that behaves phonologically like an affix (short, unaccented), but syntactically like a word . Lies between a word and an affix in behavior . Often functions as: Pronouns (’s in she’s) Auxiliary verbs (’ve, ’re) Articles, conjunctions, etc. Types of Clitics: Type Position Example Proclitic Before the host word ’Tis (for it is ) Enclitic After the host word She’s (for she i

Finite-State Morphological Parsing Morphological parsing refers to analyzing the structure of words to understand their components (e.g., stems, affixes). The goal is to map input forms (surface word forms) to output forms that include morphological analysis , often represented as stems + features. Features in Morphological Parsing Features give grammatical and syntactic information about a word's stem. For example: +N = Noun +Sg = Singular +Pl = Plural + Masc = Masculine (used in Spanish)

These features are critical for: Part-of-speech tagging Syntactic parsing Machine translation Information retrieval

Three Core Components for Morphological Parsing 1. Lexicon A repository of stems and affixes . Each entry contains basic information such as: Whether the item is a Noun stem , Verb stem , etc. 2. Morphotactics Defines legal sequences of morphemes within words. Describes which morphemes can follow others (e.g., in English, plural morphemes follow nouns). 3. Orthographic Rules (Spelling Rules) Account for spelling changes that occur when morphemes combine. Example: city + -s → cities (not citys ) due to y → ie rule. Important for mapping surface forms to underlying morpheme sequences.

Building a Finite-State Lexicon Lexicon Definition : In computational linguistics, a lexicon is a structured repository of words used by a language processing system. Simplest Form : The most basic lexicon lists every word explicitly (e.g., “AAA”, “Jane”, “Beijing”, etc.), but this approach is impractical for large-scale systems. Challenge : Listing all words is inconvenient or impossible due to the vast number of forms and constant language evolution.

Efficient Alternative : Computational lexicons are usually built using stems and affixes , rather than full word forms. Morphotactics : These lexicons also include rules (called morphotactics ) that define how morphemes can combine to form valid words. Finite-State Automaton (FSA) : One of the most common models to represent morphotactics is the finite-state automaton , which is efficient and widely used in NLP applications like morphological parsing.

Finite-State Assumption : The FSA (Finite-State Automaton) described in Fig. 3.3 assumes a lexicon containing regular nouns that form plurals by adding “-s” (e.g., cat, dog, fox, aardvark ). Regular Noun Category : These nouns are labeled as reg-noun , representing the majority of English nouns in computational models.

Phonological Adjustment : Some regular nouns (e.g., fox ) require insertion of “e” before “s” to form the plural ( fox → foxes ), demonstrating a basic morphophonemic rule . Irregular Nouns : The lexicon also includes irregular noun forms that do not follow the -s plural rule. Singular irregular nouns : Labeled as irreg -sg-noun (e.g., goose, mouse ) Plural irregular nouns : Labeled as irreg -pl-noun (e.g., geese, mice ) Lexicon Categorization : By classifying nouns into categories like reg-noun , irreg -sg-noun , and irreg -pl-noun , computational systems can better handle morphological analysis and generation .

Three stem classes: reg-verb-stem : Regular verbs (e.g., walk, play) irreg -verb-stem : Irregular verbs (e.g., go, eat) irreg -past-verb-form : Pre-formed irregular pasts (e.g., went, ate) Four affix classes: -ed (past) -ed (past participle) - ing (present participle) -s (third-person singular present)

States in the FSA q₀ (Start State) : Initial state where the verb stem is selected. From here, the system can choose one of three paths depending on the stem type. q₁ : Represents regular verb stems that take past tense and past participle affixes (both often "-ed" ). Example: walk → walked (past), walked (past participle) q₂ : Handles present participles formed with "- ing " . Also used for 3rd person singular present ( "-s" ) forms. Example: walk → walking , walks q₃ (Final State) : Terminal state where the complete verb form is recognized (inflected form is formed).

Derivational morphology is more complex than inflectional morphology and may require context-free grammars instead of just FSAs. A simpler case is English adjectives, which can include: Optional prefix : un- (e.g., unhappy ) Obligatory root : happy, cool, big Optional suffix : -er , - est , - ly (e.g., happier, happiest, happily ) Example words : clear → clearly, unclear, unclearly happy → unhappy, happiest, unhappily This structure can be modeled by a finite-state automaton for basic adjective derivation in NLP tasks.

FSA for English Adjective Morphology – Antworth’s Proposal This Finite-State Automaton (FSA) models how English adjectives are formed by combining prefixes , roots , and suffixes . States and Transitions q₀ : Start state Can go directly to q₁ (root) via ε (empty transition) Or go to q₁ by adding prefix "un-" q₁ : Prefix processed Accepts the adjective root (e.g., clear, happy, big ) q₂ : Root processed Accepts optional derivational suffixes : -er , - est , - ly q₃ : Final state — complete adjective form (e.g., unhappily, clearer, happiest )

Finite-State Automata (FSA) and Morphological Analysis FSAs can be used to recognize derivational and inflectional patterns in English, such as identifying valid adjectives or verbs. However, FSAs may also overgenerate — e.g., accept ungrammatical words like unbig , smally , etc., unless constraints are encoded. Need for Lexical Constraints To avoid invalid formations, roots must be classified (e.g., adj-root1 , adj-root2 ) based on which suffixes they can combine with. For instance: adj-root1 (e.g., happy , real ) allows affixes like un- and - ly adj-root2 (e.g., big , small ) does not

What This FSA Represents This FSA models how base words (nouns, verbs, adjectives) can be transformed into new words using derivational morphemes (like - ize , - ation , -ness , - ly , etc.). Derivational morphology is used to create new words and change the grammatical category of words (e.g., formal → formality , happy → happiness ). Key Components in the FSA States ( q0 , q1 , q2 , etc.) : Represent stages in the morphological derivation process. Transitions (edges with labels) : Show which suffixes can be added at each stage. Labels like - ize /V , - ation /N , - ly /Adv : The suffix (e.g., - ize ) Followed by the part-of-speech it produces (e.g., V = verb, N = noun, A = adjective, Adv = adverb).

Walkthrough of Some Paths Verb Formation From q0 → q1 via noun₁ Then q1 → q2 via - ize /V → now the word is a verb q2 → q3 via - ation /N → now it becomes a noun Example: fossil → fossilize → fossilization Adjective to Noun q0 → q5 via adjₐl (like equal , formal ) q5 → q6 via -ness/N , - ity /N → now it becomes a noun Example: formal → formality , casual → casualness Adjective to Adverb q5 → q8 → q9 via - ly /Adv Example: happy → happily Alternate Adjective Suffixes From q0 , adjectives like active , passive , etc., move through: q7 , q10 , q11 using - ive /A , - ative /A , - ful /A

Applications in Computational Linguistics Morphological Parsing : Systems can use FSAs like this to analyze complex word forms and determine their root and derivation path. Text-to-Speech & Speech Recognition : Helps in pronunciation prediction based on morphemes. Information Retrieval & NLP : By reducing words to their base forms , FSAs help in lemmatization , improving search and indexing. Machine Translation : Understanding derivational structure helps in translating morphologically rich languages.

Morphological Analysis Morphological analysis involves studying the structure and formation of words, which is crucial for understanding and processing language effectively . Morphology is the branch of linguistics concerned with the structure and form of words in a language . Morphological analysis, in the context of NLP, refers to the computational processing of word structures .   It aims to break down words into their constituent parts, such as roots, prefixes, and suffixes, and understand their roles and meanings.   This process is essential for various NLP tasks, including language modeling, text analysis, and machine translation.

Importance of Morphological Analysis Morphological analysis is a critical step in NLP for several reasons: Understanding Word Formation : It helps in identifying the basic building blocks of words, which is crucial for language comprehension. Improving Text Analysis : By breaking down words into their roots and affixes, it enhances the accuracy of text analysis tasks like sentiment analysis and topic modeling. Enhancing Language Models : Morphological analysis provides detailed insights into word formation, improving the performance of language models used in tasks like speech recognition and text generation. Facilitating Multilingual Processing : It aids in handling the morphological diversity of different languages, making NLP systems more robust and versatile.

Applications of Morphological Analysis Morphological analysis has numerous applications in NLP, contributing to the advancement of various technologies and systems: Information Retrieval : Enhances search engines by improving the matching of query terms with relevant documents, even if they are in different morphological forms. Machine Translation : Facilitates accurate translation by understanding and generating correct word forms in different languages. Text-to-Speech Systems : Improves pronunciation and intonation by accurately identifying word structures and their stress patterns. Spell Checkers and Grammar Checkers : Detects and suggests corrections for misspelled words and grammatical errors by analyzing word forms and their usage. Named Entity Recognition (NER) : Helps in identifying and classifying named entities in text by understanding their morphological variations.

Key Techniques used in Morphological Analysis for NLP Tasks 1. Stemming 2. Lemmatization 3. Morphological Parsing 4. Neural Network Models 5. Rule-Based Methods 6. Hidden Markov Models (HMMs)

Key Techniques used in Morphological Analysis for NLP Tasks 1. Stemming Stemming  reduces words to their base or root form, usually by removing suffixes. The resulting stems are not necessarily valid words but are useful for text normalization. Common ways to implement stemming in python: Porter Stemmer : One of the most popular stemming algorithms, known for its simplicity and efficiency. Snowball Stemmer : An improvement over the Porter Stemmer, supporting multiple languages. Lancaster Stemmer : A more aggressive stemming algorithm, often resulting in shorter stems. 2. Lemmatization Lemmatization  reduces words to their base or dictionary form (lemma). It considers the context and part of speech, producing valid words. To implement lemmatization in python,  WordNet Lemmatizer  is used, which leverages the WordNet lexical database to find the base form of words.

Morphological Parsing Morphological parsing involves analyzing the structure of words to identify their morphemes (roots, prefixes, suffixes). It requires knowledge of morphological rules and patterns. Finite-State Transducers (FSTs) is uses as a tool for morphological parsing. Finite-State Transducers (FSTs) FSTs are computational models used to represent and analyze the morphological structure of words. They consist of states and transitions, capturing the rules of word formation. Applications : Morphological Analysis : Parsing words into their morphemes. Morphological Generation : Generating word forms from morphemes.

Neural Network Models   Neural network  models, especially deep learning models, can be trained to perform morphological analysis by learning patterns from large datasets. Types of Neural Network Recurrent Neural Networks (RNNs) : Useful for sequential data like text. Convolutional Neural Networks (CNNs) : Can capture local patterns in the text. Transformers : Advanced models like BERT and GPT that understand context and semantics.

Rule-Based Methods Rule-based methods rely on manually defined linguistic rules for morphological analysis. These rules can handle specific language patterns and exceptions. Applications : Affix Stripping : Removing known prefixes and suffixes to find the root form. Inflectional Analysis : Identifying grammatical variations like tense, number, and case.

Hidden Markov Models (HMMs) Hidden Markov Models (HMMs)  are probabilistic models that can be used to analyze sequences of data, such as morphemes in words. HMMs consist of a set of hidden states, each representing a possible state of the system, and observable outputs generated from these states. In the context of morphological analysis, HMMs can be used to model the probabilistic relationships between sequences of morphemes, helping to predict the most likely sequence of morphemes for a given word. Components of Hidden Markov Models (HMMs): States : Represent different parts of words (e.g., prefixes, roots, suffixes). Observations : The actual characters or morphemes in the words. Transition Probabilities : Probabilities of moving from one state to another. Emission Probabilities : Probabilities of an observable output being generated from a state. Applications : Morphological Segmentation : Breaking words into morphemes. Part-of-Speech Tagging : Assigning parts of speech to each word in a sentence. Sequence Prediction : Predicting the most likely sequence of morphemes for a given word.