BoW In AIML So we can import Devam Rana.pptx

devamrana27 8 views 16 slides Oct 14, 2024
Slide 1
Slide 1 of 16
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16

About This Presentation

bowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowhjhvvbowh...


Slide Content

Bag Of Words ( BoW ) Prepared by: pk pathare

What is a Bag-of-Words?  A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: 1. A vocabulary of known words. 2.A measure of the presence of known words. It is called a bag-of-words , because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.

Example: Below is a snippet of the first few lines of text from the book A Tale of Two Cities by Charles Dickens. It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

Design the Vocabulary  Make a list of all of the words in our model vocabulary. The CountVectorizer provides a simple way to tokenize a collection of text documents and build a vocabulary of known words. Create an instance of the CountVectorizer class. Call the fit() function in order to learn a vocabulary from one or more documents.

from sklearn.feature_extraction.text import CountVectorizer # Multiple documents text = ["It was the best of times", "it was the worst of times", "it was the age of wisdom", "it was the age of foolishness"] # create the transform vectorizer = CountVectorizer () # tokenize and build vocab vectorizer.fit (text) # summarize print(sorted( vectorizer.vocabulary _))

output ['age', 'best', 'foolishness', 'it', 'of', 'the', 'times', 'was', 'wisdom', 'worst'] That is a vocabulary of 10 words from a corpus containing 24 words.

Create Document Vectors  Document Vectors with CountVectorizer   Next step is to score the words in each document. Because we know the vocabulary has 10 words, we can use a fixed-length document representation of 10, with one position in the vector to score each word. The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present. Call the transform() function on one or more documents as needed to encode each as a vector.

# encode document vector = vectorizer.transform (text) # summarize encoded vector print( vector.shape ) print( vector.toarray ())

Output: (4, 10) [[0 1 0 1 1 1 1 1 0 0] [0 0 0 1 1 1 1 1 0 1] [1 0 0 1 1 1 0 1 1 0] [1 0 1 1 1 1 0 1 0 0]]

The same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector. # encode another document text2 = ["the the the times"] vector = vectorizer.transform (text2) print( vector.toarray ()) Output: [[0 0 0 0 0 3 1 0 0 0]]

Document Vectors with TfidfVectorizer   from sklearn.feature_extraction.text import TfidfVectorizer # list of text documents text = ["It was the best of times", "it was the worst of times", "it was the age of wisdom", "it was the age of foolishness"] # create the transform vectorizer = TfidfVectorizer () # tokenize and build vocab vectorizer.fit (text) # summarize print(sorted( vectorizer.vocabulary _)) # encode document vector = vectorizer.transform ([text[0]])

output ['age', 'best', 'foolishness', 'it', 'of', 'the', 'times', 'was', 'wisdom', 'worst']

print( vectorizer.idf _) [1.51082562 1.91629073 1.91629073 1. 1. 1. 1.51082562 1. 1.91629073 1.91629073] A vocabulary of 10 words is learned from the documents and each word is assigned a unique integer index in the output vector. The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed words: "it", "of", "the" , "was".

# summarize encoded vector print( vector.shape ) print( vector.toarray ()) (1, 10) [[0. 0.60735961 0. 0.31694544 0.31694544 0.31694544 0.4788493 0.31694544 0. 0. ]] The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.

Challenges with BoW If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too. Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid) We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text. The basic BoW model doesn’t consider the word’s meaning in the document. It will completely ignore the context in which it is used. We might use the same word in different places based on the context or nearby words. 

Challenges with tf-idf TF-IDF is based on the bag-of-words ( BoW ) model, therefore it does not capture position in text, semantics, co-occurrences in different documents, etc. For this reason, TF-IDF is only useful as a lexical level feature Cannot capture semantics (e.g. as compared to topic models, word embeddings )