Text Analysis Operations using NLTK.pptx

devamrana27 14 views 39 slides Sep 15, 2024

Slide 1 of 39

About This Presentation

AJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdfAJP Assignment Q.pdf...

Size: 155.46 KB

Language: en

Added: Sep 15, 2024

Slides: 39 pages

Slide Content

Text Analysis Operations using NLTK Prepared by- pk pathare

NLTK? NLTK (natural language toolkit) is a powerful Python package that provides a set of diverse natural languages algorithms. It is free, open source , easy to use, large community, and well documented . NLTK consists of the most common algorithms such as tokenizing , part-of-speech tagging , stemming , sentiment analysis , topic segmentation , and named entity recognition . NLTK helps the computer to analysis, preprocess , and understand the written text .

Tokenization Tokenization is the first step in text analytics. The process of breaking down a text paragraph into smaller chunks such as words or sentence is called Tokenization . Token is a single entity that is building blocks for sentence or paragraph.

Sentence Tokenization Sentence tokenizer breaks text paragraph into sentences import nltk from nltk.tokenize import sent_tokenize

text ="""Hello Mr. Smith, how are you doing today? The weather is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard ""“ tokenized_text = sent_tokenize (text) print ( tokenized_text )

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]

Word Tokenization Word tokenizer breaks text paragraph into words. from nltk.tokenize import word_tokenize tokenized_word = word_tokenize (text) print ( tokenized_word ) ['Hello', ' Mr. ', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', " n't ", 'eat', 'cardboard']

Frequency Distribution from nltk.probability import FreqDist fdist = FreqDist ( tokenized_word ) print ( fdist ) < FreqDist with 25 samples and 30 outcomes>

fdist . most_common ( 2 ) [('is', 3), (',', 2)]

Stopwords Stopwords considered as noise in the text. Text may contain stop words such as is, am, are, this, a, an, the, etc. In NLTK for removing stopwords , you need to create a list of stopwords and filter out your list of tokens from these words . from nltk.corpus import stopwords stop_words =set ( stopwords . words ( " english " )) print ( stop_words )

{'before', 'then', 'ain', ' weren ', 'that', "mustn't", 'these', "don't", "you'd", 'about', 'between', 'them', 'under', 'at', 'to', "you're", 'once', 'here', 'during', 'him', "you'll", 'yourselves', 'we', 's', ' mustn ', "it's", ' didn ', 'up', 'myself', "she's", 'being', 'am', 'o', 't', 'you', 'where', 'it', 'too', 'while', 'when', 'very', ' hadn ', 've', "mightn't", 'further', ' wouldn ', 'is', 'are', 'yourself', 'doing', 'had', 'but', 'can', "wouldn't", ' doesn ', "doesn't", 'ourselves', ' hasn ', 'from', 'through', 'for', 'of', 'whom', 'such', ' shouldn ', 'with', 'an', 'and', "didn't", 'those', 'again', "needn't", "aren't", "weren't", 'herself', 'be', 'll', 'just', 'own', "shan't", "that'll", 'do', 'over', 'don', "couldn't", 'more', 'other', 'not', 'if', 'most', 'why', 'so', "won't", 'how', 'were', 'its', 'itself', 'a', 'was', "wasn't", 'themselves', 'haven', 'he', 'out', 'my', 'into', 'ours', 'both', ' couldn ', "shouldn't", 'which', 'been', 'there', 'same', 'few', 'off', 'y', 'has', ' isn ', 'no', 'd', 're', ' needn ', 'as', 'by', ' shan ', "you've", 'have', 'some', 'having', 'm', 'his', 'they', "isn't", 'himself', 'i', 'each', 'hers', 'nor', 'only', "haven't", 'what', 'now', 'down', 'above', 'she', 'should', "should've", ' wasn ', 'on', 'the', 'me', 'after', 'their', 'did', 'or', 'theirs', 'won', 'our', ' mightn ', 'who', "hadn't", 'in', 'than', 'yours', ' aren ', 'until', 'does', 'against', 'all', 'this', 'any', "hasn't", 'her', 'your', 'below', 'will', 'ma', 'because'}

Removing Stopwords filtered_sent = [] for w in tokenized_word : if w not in stop_words : filtered_sent . append (w) print ( "Tokenized Sentence:" , tokenized_word ) print ( " Filterd Sentence:" , filtered_sent )

Tokenized Sentence: ['Hello', ' Mr. ', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', " n't ", 'eat', 'cardboard'] Filterd Sentence: ['Hello', ' Mr. ', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.', 'The', 'sky', 'pinkish-blue', '.', 'You', " n't ", 'eat', 'cardboard']

Lexicon Normalization Lexicon normalization considers another type of noise in the text. For example, connection, connected, connecting word reduce to a common word "connect". It reduces derivationally related forms of a word to a common root word.

Stemming Stemming is a process of linguistic normalization, which reduces words to their word root word or chops off the derivational affixes . For example, connection, connected, connecting word reduce to a common word "connect".

Consider the words: learn learn ing learn ed learn t All these words are stemmed from its common root learn . However, in some cases, the stemming process produces words that are not correct spellings of the root word. For example, happi and sunni . That's because it chooses the most common stem for related words. For example, we can look at the set of words that comprises the different forms of happy: happ y happi ness happi er We can see that the prefix happi is more commonly used. We cannot choose happ because it is the stem of unrelated words like happen .

from nltk.stem import PorterStemmer from nltk.tokenize import sent_tokenize , word_tokenize ps = PorterStemmer () stemmed_words = [] for w in filtered_sent : stemmed_words . append ( ps . stem (w)) print ( "Filtered Sentence:" , filtered_sent ) print ( "Stemmed Sentence:" , stemmed_words )

Filtered Sentence: ['Hello', ' Mr. ', 'Smith', ',', 'today', '?', 'The', 'weather', 'great', ',', 'city', 'awesome', '.', 'The', 'sky', 'pinkish-blue', '.', 'You', " n't ", 'eat', 'cardboard'] Stemmed Sentence: ['hello', ' mr. ', 'smith', ',', 'today', '?', 'the', 'weather', 'great', ',', ' citi ', ' awesom ', '.', 'the', 'sky', 'pinkish- blu ', '.', 'you', " n't ", 'eat', 'cardboard']

Lemmatization Lemmatization reduces words to their base word, which is linguistically correct lemmas. It transforms root word with the use of vocabulary and morphological analysis. Lemmatization is usually more sophisticated than stemming. Stemmer works on an individual word without knowledge of the context. For example, The word "better" has "good" as its lemma. This thing will miss by stemming because it requires a dictionary look-up.

#Lexicon Normalization #performing stemming and Lemmatization from nltk.stem.wordnet import WordNetLemmatizer lem = WordNetLemmatizer () from nltk.stem.porter import PorterStemmer stem = PorterStemmer () word = "flying" print ( "Lemmatized Word:" , lem . lemmatize ( word, "v " )) print ( "Stemmed Word:" , stem . stem (word))

output Lemmatized Word: fly Stemmed Word: fli

POS Tagging The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

import numpy as np import pandas as pd import os from nltk import word_tokenize , pos_tag # os.listdir ('../input/ nlp -getting-started/train.csv') # read data from nlp -getting-started nlp_start_df = pd.read_csv ('../input/ nlp -getting-started/train.csv') # take one example sentence ex = nlp_start_df.loc [159]['text'] # tokenize the sencence and apply POS tagging sent = pos_tag ( word_tokenize (ex)) sent

[('Experts', 'NNS'), ('in', 'IN'), ('France', 'NNP'), ('begin', 'VB'), ('examining', 'VBG'), ('airplane', 'JJ'), ('debris', 'NN'), ('found', 'VBD'), ('on', 'IN'), ('Reunion', 'NNP'), ('Island', 'NNP'), (':', ':'), ('French', 'JJ'), ('air', 'NN'), ('accident', 'NN'), ('experts', 'NNS'), ('on', 'IN'), (' Wedn ', 'NNP'), ('...', ':'), ('http', 'NN'), (':', ':'), ('//t.co/v4SMAESLK5', 'NN')]

CC coordinating conjunction CD cardinal digit DT determiner EX existential there (like: "there is" ... think of it like "there exists") FW foreign word IN preposition/subordinating conjunction JJ adjective 'big' JJR adjective, comparative 'bigger' JJS adjective, superlative 'biggest' LS list marker 1) MD modal could, will NN noun, singular 'desk'

NNS noun plural 'desks' NNP proper noun, singular 'Harrison' NNPS proper noun, plural 'Americans' PDT predeterminer 'all the kids' POS possessive ending parent\'s PRP personal pronoun I, he, she RB adverb very, silently, RBR adverb, comparative better RBS adverb, superlative best

RP particle give up TO to go 'to' the store. UH interjection errrrrrrrm VB verb, base form take VBD verb, past tense took VBG verb, gerund/present participle taking VBN verb, past participle taken VBP verb, sing. present, non-3d take VBZ verb, 3rd person sing. present takes WDT wh -determiner which WP wh -pronoun who, what WRB wh-abverb where, when

Chunking Chunking in NLP is a process of grouping small pieces of information into large units. The primary use of Chunking is making groups of "noun phrases .“ It is used to add structure to the sentence by following POS tagging combined with regular expressions. The resulted group of words are called "chunks." There are no pre-defined rules for Chunking, but we can made according to our needs. Thus , if we want to chunk only 'NN' tags, we need to use pattern ` mychunk :{<NN>}` but if we need to chunk all types of tags which start with 'NN', we'll use ` mychunk :{<NN.*>}`.

!pip install svgling from nltk import RegexpParser from nltk.draw.tree import TreeView from IPython.display import Image import svgling # chunk all adjacence nouns patterns= """ mychunk :{<NN.*>+}""" chunker = RegexpParser (patterns) output = chunker.parse ( sent ) print("After Chunking",output ) svgling.draw_tree (output)

Named Entity Recognition The named entity recognition (NER) is one of the most popular data preprocessing task. At its core, NLP is just a two-step process: Detecting the entities from the text Classifying them into different categories

Some of the categories that are the most important architecture in NER such that : Person Organization Place/ location Other common tasks include classifying of the following: date/time . expression Numeral measurement (money, percent , weight, etc ) E-mail address

Extracting Named Entities Recognizing a named entity is a specific kind of chunk extraction that uses entity tags along with chunk tags. Common entity tags include PERSON, LOCATION, and ORGANIZATION . NLTK has already a pre-trained named entity chunker which can be used using ne_chunk () method in the nltk.chunk module .

from nltk.chunk import ne_chunk def extract_ne (trees, labels): ne_list = [] for tree in ne_res : if hasattr (tree, 'label' ): if tree . label () in labels: ne_list . append (tree ) return ne_list ne_res = ne_chunk ( pos_tag ( word_tokenize (ex))) labels = [ 'ORGANIZATION ' ] print ( extract_ne ( ne_res , labels ))

output [Tree('ORGANIZATION', [('Reunion', 'NNP'), ('Island', 'NNP')])]

Ambiguity in NE For a person, the category definition is intuitively quite clear, but for computers, there is some ambiguity in classification. Let’s look at some ambiguous example: England (Organisation) won the 2019 world cup vs The 2019 world cup happened in England(Location). Washington(Location) is the capital of the US vs The first president of the US was Washington(Person) .

WordNet WordNet is a lexical database for the English language is part of the NLTK corpus. WordNet NLTK module to find the meanings of words, synonyms, antonyms, and more.

import nltk nltk.download (' wordnet ') from nltk.corpus import wordnet

WordNet is a dictionary designed for programmatic access by natural language processing systems. It has many different use cases, including : - Looking up the defnition of a word - Finding synonyms and antonyms - Exploring word relations and similarity - Word sense disambiguation for words that have multiple uses and definitions NLTK includes a WordNet corpus reader, which we will use to access and explore WordNet . A corpus is just a body of text, and corpus readers are designed to make accessing a corpus much easier than direct fle access.

Text Analysis Operations using NLTK.pptx

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Text Analysis Operations using NLTK.pptx

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......