Vector Space Model Vector Space Model can be used for search engines and document retrieval system. Given a set of documents and search terms/ query we need to retrieve relevant documents that are similar to the search query. Documents Relevant documents Search query
Steps of Vector Space Model A vector space model is an algebraic model, involving two steps: In first step we represent the text document into vector of words and in Second step we transform to numerical format so that we can apply an y text mining techniques such as information retrieval, information extraction, information filtering etc.
Example of vector space model Let us understand with example. Consider below statements: Document 1: good boy Document 2: good girl Document 3: boy girl good
Document vectors representation First step in this step includes breaking each document into words, applying preprocessing steps such as removing stopwords , punctuation, special characters etc. document 1: (good, boy) document 2: (good, girl) document 3: (good, boy, girl) Next step is to represent the above created vectors of terms to numerical format known as term document matrix.
Term Document Matrix. A term document matrix is a way of represent document vectors in a matrix format in which each row represent term vectors across all the document and columns represent document vectors across all the terms. The cell value frequency count of each term in corresponding document. If a term is present in a document, then the corresponding cell value contain 1 else if the term is not present in the document then the cell value contain 0.
TF*IDF We should note that a word occurs in most of the documents might not contribute to represent the document relevance. Whereas less frequency occurred terms might define document relevance. This can achieve using a method known as term frequency – inverse document frequency ( tf-idf )
TF*IDF First we calculate TF tf =No. of rep in a doc / No. of words in doc Second we calculate DF df = log(No. of documents)/No. of documents containing words Tf-idf = tf * idf
Document 1: good boy Document 2: good girl Document 3: boy girl good tf =No. of rep in a doc / No. of words in doc doc3 doc2 doc1 1/3 ½ ½ Good 1/3 ½ Boy 1/3 1/2 girl
Document 1: good boy Document 2: good girl Document 3: boy girl good df = log(No. of documents)/No. of documents containing words DF word Log(3/3) Good Log(3/2) Boy Log(3/2) girl
Tf-idf = tf * idf doc3 doc2 doc1 1/3 ½ ½ Good 1/3 ½ Boy 1/3 1/2 girl DF word Log(3/3)=0 Good Log(3/2) Boy Log(3/2) girl girl boy good ½*log(3/2) Doc1 ½*log(3/2) Doc2 1/3*log(3/2) 1/3*log(3/2) Doc3
Example 2 Document 1: A cat runs behind rat Document 2: The dog runs behind cat Document 3: The bull runs behind the player query: rat Doc 1: (cat, runs, behind, rat) Doc 2: (dog, runs, behind, cat) Doc 3: (bull, runs, behind, player) Query: (rat) The relevant document to query = Max ( similarity score between (doc 1, Query), similarity score between (doc 2, Query)) Next step is to represent the above created vectors of terms to numerical format (term document matrix).
idf = log(n/ df ) Document frequency ( df ) 2 2 2 0.30103 1 0.30103 1 query doc2 doc1 Words/documents Cat Runs Behind 0.30103 0.30103 Rat 0.30103 dog
Advantages of vector space model The vector space model has the following advantages: Allows ranking documents according to their possible relavance . Allows retrieving items with partial term overlap.
Limitation The vector space models has the following limitation: Query terms are assumed to be independent, so phrases might not be represented well in the ranking. Semantic sensitivity ; documents with similar vocabulary won’t be associated.
Models based on the vector space model Models based on and extending the vector space model include: Generalized vector space model
Latent semantic analysis
Term Rocchio Classification
Random indexing
Search Engine Optimization
References Büttcher , Stefan; Clarke, Charles L. A.; Cormack, Gordon V. (2016). Information retrieval: implementing and evaluating search engines (First MIT Press paperback ed.). Cambridge, Massachusetts London, England: The MIT Press. ISBN 978-0-262-52887-0.
G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, v.18 n.11, p.613–620, Nov. 1975 https://en.wikipedia.org/wiki/Vector_space_model#cite_ref-:0_1-0