What is a Bag-of-Words? A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things: 1. A vocabulary of known words. 2.A measure of the presence of known words. It is called a bag-of-words , because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.
Example: Below is a snippet of the first few lines of text from the book A Tale of Two Cities by Charles Dickens. It was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
Design the Vocabulary Make a list of all of the words in our model vocabulary. The CountVectorizer provides a simple way to tokenize a collection of text documents and build a vocabulary of known words. Create an instance of the CountVectorizer class. Call the fit() function in order to learn a vocabulary from one or more documents.
from sklearn.feature_extraction.text import CountVectorizer # Multiple documents text = ["It was the best of times", "it was the worst of times", "it was the age of wisdom", "it was the age of foolishness"] # create the transform vectorizer = CountVectorizer () # tokenize and build vocab vectorizer.fit (text) # summarize print(sorted( vectorizer.vocabulary _))
output ['age', 'best', 'foolishness', 'it', 'of', 'the', 'times', 'was', 'wisdom', 'worst'] That is a vocabulary of 10 words from a corpus containing 24 words.
Create Document Vectors Document Vectors with CountVectorizer Next step is to score the words in each document. Because we know the vocabulary has 10 words, we can use a fixed-length document representation of 10, with one position in the vector to score each word. The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present. Call the transform() function on one or more documents as needed to encode each as a vector.
The same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector. # encode another document text2 = ["the the the times"] vector = vectorizer.transform (text2) print( vector.toarray ()) Output: [[0 0 0 0 0 3 1 0 0 0]]
Document Vectors with TfidfVectorizer from sklearn.feature_extraction.text import TfidfVectorizer # list of text documents text = ["It was the best of times", "it was the worst of times", "it was the age of wisdom", "it was the age of foolishness"] # create the transform vectorizer = TfidfVectorizer () # tokenize and build vocab vectorizer.fit (text) # summarize print(sorted( vectorizer.vocabulary _)) # encode document vector = vectorizer.transform ([text[0]])
print( vectorizer.idf _) [1.51082562 1.91629073 1.91629073 1. 1. 1. 1.51082562 1. 1.91629073 1.91629073] A vocabulary of 10 words is learned from the documents and each word is assigned a unique integer index in the output vector. The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed words: "it", "of", "the" , "was".
# summarize encoded vector print( vector.shape ) print( vector.toarray ()) (1, 10) [[0. 0.60735961 0. 0.31694544 0.31694544 0.31694544 0.4788493 0.31694544 0. 0. ]] The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.
Challenges with BoW If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too. Additionally, the vectors would also contain many 0s, thereby resulting in a sparse matrix (which is what we would like to avoid) We are retaining no information on the grammar of the sentences nor on the ordering of the words in the text. The basic BoW model doesn’t consider the word’s meaning in the document. It will completely ignore the context in which it is used. We might use the same word in different places based on the context or nearby words.
Challenges with tf-idf TF-IDF is based on the bag-of-words ( BoW ) model, therefore it does not capture position in text, semantics, co-occurrences in different documents, etc. For this reason, TF-IDF is only useful as a lexical level feature Cannot capture semantics (e.g. as compared to topic models, word embeddings )