Text Analysis Operations using NLTK.pptx

devamrana27 14 views 39 slides Sep 15, 2024
Slide 1
Slide 1 of 39
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39

About This Presentation

AJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdf...


Slide Content

Text Analysis Operations using NLTK Prepared by- pk pathare

NLTK? NLTK (natural language toolkit) is a powerful Python package that provides a set of diverse natural languages algorithms. It is free, open source , easy to use, large community, and well documented . NLTK consists of the most common algorithms such as tokenizing , part-of-speech tagging , stemming , sentiment analysis , topic segmentation , and named entity recognition . NLTK helps the computer to analysis, preprocess , and understand the written text .

Tokenization Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization . Token is a single entity that is building blocks for sentence or paragraph.

Sentence Tokenization Sentence tokenizer breaks text paragraph into sentences import nltk from nltk.tokenize import sent_tokenize

text ="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard ""“ tokenized_text = sent_tokenize (text) print ( tokenized_text )

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]

Word Tokenization Word tokenizer breaks text paragraph into words. from nltk.tokenize import word_tokenize tokenized_word = word_tokenize (text) print ( tokenized_word ) ['Hello', ' Mr. ', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', " n't ", 'eat', 'cardboard']

Frequency Distribution from nltk.probability import FreqDist fdist = FreqDist ( tokenized_word ) print ( fdist ) < FreqDist with 25 samples and 30 outcomes>

fdist . most_common ( 2 ) [('is', 3), (',', 2)]

Stopwords Stopwords considered as noise in the text. Text may contain stop words such as is, am, are, this, a, an, the, etc. In NLTK for removing stopwords , you need to create a list of stopwords and filter out your list of tokens from these words . from nltk.corpus import stopwords stop_words =set ( stopwords . words ( " english " )) print ( stop_words )

{'before', 'then', 'ain', ' weren ', 'that', "mustn't", 'these', "don't", "you'd", 'about', 'between', 'them', 'under', 'at', 'to', "you're", 'once', 'here', 'during', 'him', "you'll", 'yourselves', 'we', 's', ' mustn ', "it's", ' didn ', 'up', 'myself', "she's", 'being', 'am', 'o', 't', 'you', 'where', 'it', 'too', 'while', 'when', 'very', ' hadn ', 've', "mightn't", 'further', ' wouldn ', 'is', 'are', 'yourself', 'doing', 'had', 'but', 'can', "wouldn't", ' doesn ', "doesn't", 'ourselves', ' hasn ', 'from', 'through', 'for', 'of', 'whom', 'such', ' shouldn ', 'with', 'an', 'and', "didn't", 'those', 'again', "needn't", "aren't", "weren't", 'herself', 'be', 'll', 'just', 'own', "shan't", "that'll", 'do', 'over', 'don', "couldn't", 'more', 'other', 'not', 'if', 'most', 'why', 'so', "won't", 'how', 'were', 'its', 'itself', 'a', 'was', "wasn't", 'themselves', 'haven', 'he', 'out', 'my', 'into', 'ours', 'both', ' couldn ', "shouldn't", 'which', 'been', 'there', 'same', 'few', 'off', 'y', 'has', ' isn ', 'no', 'd', 're', ' needn ', 'as', 'by', ' shan ', "you've", 'have', 'some', 'having', 'm', 'his', 'they', "isn't", 'himself', 'i', 'each', 'hers', 'nor', 'only', "haven't", 'what', 'now', 'down', 'above', 'she', 'should', "should've", ' wasn ', 'on', 'the', 'me', 'after', 'their', 'did', 'or', 'theirs', 'won', 'our', ' mightn ', 'who', "hadn't", 'in', 'than', 'yours', ' aren ', 'until', 'does', 'against', 'all', 'this', 'any', "hasn't", 'her', 'your', 'below', 'will', 'ma', 'because'}

Removing Stopwords filtered_sent = [] for w in tokenized_word : if w not in stop_words : filtered_sent . append (w) print ( "Tokenized Sentence:" , tokenized_word ) print ( " Filterd Sentence:" , filtered_sent )

Tokenized Sentence: ['Hello', ' Mr. ', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', " n't ", 'eat', 'cardboard'] Filterd Sentence: ['Hello', ' Mr. ', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.', 'The', 'sky', 'pinkish-blue', '.', 'You', " n't ", 'eat', 'cardboard']

Lexicon Normalization Lexicon normalization considers another type of noise in the text. For example, connection, connected, connecting word reduce to a common word "connect". It reduces derivationally related forms of a word to a common root word.

Stemming Stemming is a process of linguistic normalization, which reduces words to their word root word or chops off the derivational affixes . For example, connection, connected, connecting word reduce to a common word "connect".

Consider the words: learn learn ing learn ed learn t All these words are stemmed from its common root  learn . However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example,  happi  and  sunni . That's because it chooses the most common stem for related words. For example, we can look at the set of words that comprises the different forms of happy: happ y happi ness happi er We can see that the prefix  happi  is more commonly used. We cannot choose  happ  because it is the stem of unrelated words like  happen .

from nltk.stem import PorterStemmer from nltk.tokenize import sent_tokenize , word_tokenize ps = PorterStemmer () stemmed_words = [] for w in filtered_sent : stemmed_words . append ( ps . stem (w)) print ( "Filtered Sentence:" , filtered_sent ) print ( "Stemmed Sentence:" , stemmed_words )

Filtered Sentence: ['Hello', ' Mr. ', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.', 'The', 'sky', 'pinkish-blue', '.', 'You', " n't ", 'eat', 'cardboard'] Stemmed Sentence: ['hello', ' mr. ', 'smith', ',', 'today', '?', 'the', 'weather', 'great', ',', ' citi ', ' awesom ', '.', 'the', 'sky', 'pinkish- blu ', '.', 'you', " n't ", 'eat', 'cardboard']

Lemmatization Lemmatization reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis. Lemmatization is usually more sophisticated than stemming. Stemmer works on an individual word without knowledge of the context. For example, The word "better" has "good" as its lemma. This thing will miss by stemming because it requires a dictionary look-up.

#Lexicon Normalization #performing stemming and Lemmatization from nltk.stem.wordnet import WordNetLemmatizer lem = WordNetLemmatizer () from nltk.stem.porter import PorterStemmer stem = PorterStemmer () word = "flying" print ( "Lemmatized Word:" , lem . lemmatize ( word, "v " )) print ( "Stemmed Word:" , stem . stem (word))

output Lemmatized Word: fly Stemmed Word: fli

POS Tagging The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

import numpy as np import pandas as pd import os from nltk import word_tokenize , pos_tag # os.listdir ('../input/ nlp -getting-started/train.csv') # read data from nlp -getting-started nlp_start_df = pd.read_csv ('../input/ nlp -getting-started/train.csv') # take one example sentence ex = nlp_start_df.loc [159]['text'] # tokenize the sencence and apply POS tagging sent = pos_tag ( word_tokenize (ex)) sent

[('Experts', 'NNS'), ('in', 'IN'), ('France', 'NNP'), ('begin', 'VB'), ('examining', 'VBG'), ('airplane', 'JJ'), ('debris', 'NN'), ('found', 'VBD'), ('on', 'IN'), ('Reunion', 'NNP'), ('Island', 'NNP'), (':', ':'), ('French', 'JJ'), ('air', 'NN'), ('accident', 'NN'), ('experts', 'NNS'), ('on', 'IN'), (' Wedn ', 'NNP'), ('...', ':'), ('http', 'NN'), (':', ':'), ('//t.co/v4SMAESLK5', 'NN')]

CC coordinating conjunction CD cardinal digit DT determiner EX existential there (like: "there is" ... think of it like "there exists") FW foreign word IN preposition/subordinating conjunction JJ adjective 'big' JJR adjective, comparative 'bigger' JJS adjective, superlative 'biggest' LS list marker 1) MD modal could, will NN noun, singular 'desk'

NNS noun plural 'desks' NNP proper noun, singular 'Harrison' NNPS proper noun, plural 'Americans' PDT predeterminer 'all the kids' POS possessive ending parent\'s PRP personal pronoun I, he, she RB adverb very, silently, RBR adverb, comparative better RBS adverb, superlative best

RP particle give up TO to go 'to' the store. UH interjection errrrrrrrm VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non-3d take VBZ verb, 3rd person sing. present takes WDT wh -determiner which WP wh -pronoun who, what WRB wh-abverb where, when

Chunking Chunking in NLP is a process of grouping small pieces of information into large units.  The primary use of Chunking is making groups of "noun phrases .“ It is used to add structure to the sentence by following POS tagging combined with regular expressions. The resulted group of words are called "chunks." There are no pre-defined rules for Chunking, but we can made according to our needs. Thus , if we want to chunk only 'NN' tags, we need to use pattern ` mychunk :{<NN>}` but if we need to chunk all types of tags which start with 'NN', we'll use ` mychunk :{<NN.*>}`.

!pip install svgling from nltk import RegexpParser from nltk.draw.tree import TreeView from IPython.display import Image import svgling # chunk all adjacence nouns patterns= """ mychunk :{<NN.*>+}""" chunker = RegexpParser (patterns) output = chunker.parse ( sent ) print("After Chunking",output ) svgling.draw_tree (output)

Named Entity Recognition The named entity recognition (NER) is one of the most popular data preprocessing task.  At its core, NLP is just a two-step process: Detecting the entities from the text Classifying them into different categories

Some of the categories that are the most important architecture in NER such that : Person Organization Place/ location Other common tasks include classifying of the following: date/time . expression Numeral measurement (money, percent , weight, etc ) E-mail address

Extracting Named Entities Recognizing a  named entity  is a specific kind of chunk extraction that uses entity tags along with chunk tags. Common entity tags include  PERSON, LOCATION, and ORGANIZATION . NLTK has already a pre-trained named entity chunker which can be used using ne_chunk () method in the nltk.chunk module .

from nltk.chunk import ne_chunk def extract_ne (trees, labels): ne_list = [] for tree in ne_res : if hasattr (tree, 'label' ): if tree . label () in labels: ne_list . append (tree ) return ne_list ne_res = ne_chunk ( pos_tag ( word_tokenize (ex))) labels = [ 'ORGANIZATION ' ] print ( extract_ne ( ne_res , labels ))

output [Tree('ORGANIZATION', [('Reunion', 'NNP'), ('Island', 'NNP')])]

Ambiguity in NE For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in classification. Let’s look at some ambiguous example: England (Organisation)  won the 2019 world cup vs The 2019 world cup happened in  England(Location). Washington(Location)  is the capital of the US vs The first president of the US was  Washington(Person) .

WordNet WordNet is a lexical database for the English language is part of the NLTK corpus. WordNet NLTK module to find the meanings of words, synonyms, antonyms, and more.

import nltk nltk.download (' wordnet ') from nltk.corpus import wordnet

WordNet is a dictionary designed for programmatic access by natural language processing systems. It has many different use cases, including : - Looking up the defnition of a word - Finding synonyms and antonyms - Exploring word relations and similarity - Word sense disambiguation for words that have multiple uses and definitions NLTK includes a WordNet corpus reader, which we will use to access and explore WordNet . A corpus is just a body of text, and corpus readers are designed to make accessing a corpus much easier than direct fle access.