Natural language processing UNIT-II PPTS.pptx

nagasandeeepsomepall 50 views 27 slides Jun 11, 2024
Slide 1
Slide 1 of 27
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27

About This Presentation

Natural language processing unit 2 ppt


Slide Content

Audisankara College of Engineering and Technology(A)::Gudur,AP. Department of CSE Subject Name: Natural Language Processing Semester : Sixth Sem VENKATA RATHNAM Associate Professor Department of CSE ASCET NATURAL LANGUAGE PROCESSSING (20DS602) Topic Name : SYNTACTIC ANALYSIS

UNIT-II:SYLLABUS English Word Classes, The Penn Treebank Part-of-Speech Tagset , Part-of-Speech Tagging, HMM Partof -Speech Tagging, Maximum Entropy Markov Models, Grammar Rules for English, Treebanks , Grammar Equivalence and Normal form, Lexicalized Grammar.

Audisankara College of Engineering and Technology(A)::Gudur,AP. Department of CSE ENGLISH WORD CLASSES: There are four main word classes;  nouns, verbs, adjectives, and adverbs Nouns Nouns are the words we use to describe people, places, objects, feelings, concepts, etc . Usually, nouns are tangible (touchable) things, such as a table, a person, or a building. EX:My  sister  went to  school. Verbs Verbs are words that show action, event, feeling, or state of being . This can be a physical action or event, or it can be a feeling that is experienced. EX:She   wished  for a sunny day. Adjectives Adjectives are words used to modify nouns, usually by describing them. Adjectives describe an attribute, quality, or state of being of the noun.   EX:The   friendly  woman wore a  beautiful  dress. Adverbs Adverbs are words that work alongside verbs, adjectives, and other adverbs . They provide further descriptions of how, where, when, and how often something is done. EX :'The music was  too  loud.'

The other five word classes are; prepositions, pronouns, determiners, conjunctions, and interjections . These are considered functional words, and they provide structural and relational information in a sentence or phrase. Prepositions Prepositions are used to show the relationship between words in terms of place, time, direction, and agency. EX :' They went through the tunnel.' Pronouns Pronouns take the place of a noun or a noun phrase in a sentence. They often refer to a noun that has already been mentioned and are commonly used to avoid repetition. Chloe ( noun ) → she ( pronoun ) Chloe's dog → her dog (possessive pronoun ) EX: 'She sat on the chair which was broken.' Determiners Determiners work alongside nouns to clarify information about the quantity, location, or ownership of the noun . It 'determines' exactly what is being referred to. Much like pronouns, there are also several different types of determiners. EX:' The first restaurant is better than the other.'

Conjunctions Conjunctions are words that connect other words, phrases, and clauses together within a sentence. There are three main types of conjunctions; Coordinating conjunctions - these link independent clauses together. Subordinating conjunctions - these link dependent clauses to independent clauses. Correlative conjunctions - words that work in pairs to join two parts of a sentence of equal importance. EX: For, and, nor, but, or, yet, so - coordinating conjunctions After, as, because, when, while, before, if, even though - subordinating conjunctions Either/or, neither/nor, both/and - correlative conjunctions ' If it rains, I'm not going out.' Interjections Interjections are exclamatory words used to express an emotion or a reaction. They often stand alone from the rest of the sentence and are accompanied by an exclamation mark. EX; ' Oh , what a surprise!'

Part of speech tagging Given input: this is a simple sentence the goal is to identify the part of speech (syntactic category) for each word: this/DET is/VERB a/DET simple/ADJ sentence/NOUN The set of part of speech (POS) categories can differ based on the application, corpus annotators and language. One universal tagset used by Google ( Petrov et al. 2011):

Tag Description Example VERB Verbs (all tenses and modes) eat , ate, eats NOUN Nouns (common and proper) home , Micah PRON Pronouns I , you, your, he ADJ Adjectives yellow , bigger, wildest ADV Adverbs quickly , faster, fastest ADP Adpositions (prepositions of , in, by, under and postpositions) CONJ Conjunctions and , or, but DET Determiners a , an, the, this NUM Cardinal numbers one , two, first, second PRT Particles, other function words up , down, on, off X Other: foreign words, typos, brasserie , abcense , HMM abbreviations . Punctuation ?, !, .

POS tagging is hard: • Ambiguity: glass of water/NOUN vs water/VERB the plants lie/VERB down vs tell a lie/NOUN wind/VERB down vs a mighty wind/NOUN • Sparsity : – Words we never see. – Word-tag pairs we never see.

A probabilistic model for tagging Let xt denote the word and zt denote the tag at time step t. • Initialization: z0 = <s> • Repeat: – Choose a tag based on the previous tag: P(zt|zt−1) – If zt = </s>: Break – Choose a word conditioned on its tag: P( xt|ht )

Could represent this with a state diagram:

Department of CSE Audisankara College of Engineering and Technology(A)::Gudur,AP. THE PENN TREEBANK PARTS-OF SPEEECH TAGSET: The English Penn Treebank tagset is used with English corpora annotated by the TreeTagger tool The table shows English Penn TreeBank tagset with Sketch Engine modifications Number Tag Description 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural

Department of CSE Audisankara College of Engineering and Technology(A)::Gudur,AP. 14. NNP Proper noun, singular 15. NNPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PRP Personal pronoun 19. PRP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or present participle 30. VBN Verb, past participle 31. VBP Verb, non-3rd person singular present 32. VBZ Verb, 3rd person singular present 33. WDT Wh-determiner 34. WP Wh-pronoun 35. WP$ Possessive wh-pronoun 36. WRB Wh -adverb

Department of CSE Audisankara College of Engineering and Technology(A):: Gudur,AP . PARTS-OF SPEECH TAGSET: It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag , and signifies whether the word is a noun, adjective, verb, and so on. Most of the POS tagging falls under Rule Base POS tagging, Stochastic POS tagging and Transformation based tagging. Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word. The stochastic taggers disambiguate the words based on the probability that a word occurs with a particular tag. Transformation based tagging is also called Brill tagging. It is an instance of the transformation-based learning (TBL),

Department of CSE Audisankara College of Engineering and Technology(A)::Gudur,AP. HMM PARTS-OF SPEECH TAGGING: Before digging deep into HMM POS tagging, we must understand the concept of Hidden Markov Model (HMM ). Hidden Markov Model   An HMM model may be defined as the doubly-embedded stochastic model, where the underlying stochastic process is hidden. This hidden stochastic process can only be observed through another set of stochastic processes that produces the sequence of observations. Example For example, a sequence of hidden coin tossing experiments is done and we see only the observation sequence consisting of heads and tails. The actual details of the process - how many coins used, the order in which they are selected - are hidden from us. By observing this sequence of heads and tails, we can build several HMMs to explain the sequence. Following is one form of Hidden Markov Model for this problem −

We assumed that there are two states in the HMM and each of the state corresponds to the selection of different biased coin. Following matrix gives the state transition probabilities − [Math Processing Error]=[11122122]

Here, aij = probability of transition from one state to another from i to j. a11 + a12 = 1 and a21 + a22 =1 P1 = probability of heads of the first coin i.e. the bias of the first coin. P2 = probability of heads of the second coin i.e. the bias of the second coin. We can also create an HMM model assuming that there are 3 coins or more. This way, we can characterize HMM by the following elements − N, the number of states in the model (in the above example N =2, only two states). M, the number of distinct observations that can appear with each state in the above example M = 2, i.e., H or T). A, the state transition probability distribution − the matrix A in the above example. P, the probability distribution of the observable symbols in each state (in our example P1 and P2). I, the initial state distribution.

Hidden Markov Model A hidden Markov model (HMM) defines a probability distribution over a sequence of states and output observations: • Output sequence: x1:T = x1, x2, . . . , xT . We denote these as vectors, but they can also be scalars or discrete observations. • State sequence: z0:T +1 = z0, z1, . . . , zT , zT +1. Take on an integer value zt ∈ {0, . . . , K + 1} representing the state at t.

An HMM is specified by: • A set of states: {0 , 1, . . . ,K + 1} • Transition probabilities: A with Ai,j = PA( zt = j|zt−1 = i ) • Emission distribution for each state: p( xt|zt ). We denote the emission distribution as continuous, but can also be discrete. • Group the parameters together: = {A,} The start and end states are special : • Always start in z0 = 0. Transitioning out of this start state is captured by A0,j . • Always end in zT+1 = K + 1. Transitioning into this final state is captured by Aj,K+1. • States 0 and K + 1 are non-emitting: They don’t have a corresponding x when we move in or out of them.

The three HMM problems Problem 1: The marginal probability Given an observed sequence x1: T and a trained HMM with parameters , what is the probability of the observed sequence p( x1:T )? Problem 2: The most likely state sequence Given an observed sequence x1: T and a trained HMM with parameters , what is the most likely state sequence through the HMM? arg max z0:T+1 P(z0:T+1| x1:T ) Problem 3: Learning Given training data x1: T , how do we choose the HMM parameters to maximize p( x1:T )?

Department of CSE Audisankara College of Engineering and Technology(A):: Gudur,AP . MAXIMUM ENTROPY MARKOV MODELS: The Maximum Entropy Markov Model (MEMM) has dependencies between each state and the full observation sequence explicitly. This is more expressive than HMMs. In the HMM model, we saw that it uses two probabilities matrice (state transition and emission probability). We need to predict a tag given an observation, but HMM predicts the probability of a tag producing a certain observation. This is due to its generative approach. Instead of the transition and observation matrices in HMM, MEMM has only one transition probability matrix. This matrix encapsulates all combinations of previous states y_i−1 and current observation x_i pairs in the training data to the current state y_i.

Audisankara College of Engineering and Technology(A)::Gudur,AP. Department of CSE TREEBANKS: A Treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. Types of TreeBank Corpus Semantic Treebanks: These Treebanks use a formal representation of sentence’s semantic structure. Syntactic Treebanks: Opposite to the semantic Treebanks, inputs to the Syntactic Treebank systems are expressions of the formal language obtained from the conversion of parsed Treebank data. The outputs of such systems are predicate logic based meaning representation.

Audisankara College of Engineering and Technology(A)::Gudur,AP. Department of CSE GRAMMER RULES FOR ENGLISH: 1.Adjectives and adverbs 2. Pay attention to homophones 3. Use the correct conjugation of the verb  4. Connect your ideas with conjunctions 5. Sentence construction 6. Remember the word order for questions 7. Use the right past form of verbs 8. Get familiar with the main English verb tenses 9. Never use a double negative

Audisankara College of Engineering and Technology(A)::Gudur,AP. Department of CSE GRAMMER EQUIVALENCE AND NORMAL FORM: There are lots of ways to transform grammars so that they are more useful for a particular purpose. the basic idea: 1.Apply transformation 1 to G to get of undesirable property 1. Show that the language generated by G is unchanged. 2.Apply transformation 2 to G to get rid of undesirable property 2. Show that the language generated by G is unchanged AND that undesirable property 1 has not been reintroduced. 3.Continue until the grammar is in the desired form. Normal Forms: If you want to design algorithms, it is often useful to have a limited number of input forms that you have to deal with. Normal forms are designed to do just that. Various ones have been developed for various purposes.

Department of CSE Audisankara College of Engineering and Technology(A)::Gudur,AP. LEXICALIZED GRAMMER: A   lexical grammar  is a  formal grammar  defining the  syntax  of  tokens . The program is written using characters that are defined by the lexical structure of the language used. The character set is equivalent to the alphabet used by any written language. we say that a grammar is lexicalized if it consists of: 1. A finite set of structures each associated with a lexical item. 2. An operation or operations for combining the structures. Each lexical item is called the anchor of the corresponding structure over which it specifies linguistic constraints. Hence, the constraints are local to the anchored structure.  

Department of CSE Audisankara College of Engineering and Technology(A)::Gudur,AP.