2. Vector Space Model
• Vector Space Model can be used for search engines
and document retrieval system.
• Given a set of documents and search terms/ query
we need to retrieve relevant documents that are
similar to the search query.
Documents
Relevant
documents
Search query
3. Steps of Vector Space Model
• A vector space model is an algebraic model,
involving two steps:
• In first step we represent the text document
into vector of words and in
• Second step we transform to numerical format
so that we can apply any text mining
techniques such as information retrieval,
information extraction, information filtering
etc.
4. Example of vector space model
• Let us understand with example. Consider
below statements:
• Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good
5. Document vectors representation
• First step in this step includes breaking each
document into words, applying preprocessing
steps such as removing stopwords, punctuation,
special characters etc.
document 1: (good, boy)
document 2: (good, girl)
document 3: (good, boy, girl)
• Next step is to represent the above created
vectors of terms to numerical format known as
term document matrix.
6. Term Document Matrix.
• A term document matrix is a way of represent
document vectors in a matrix format in which
each row represent term vectors across all the
document and columns represent document
vectors across all the terms.
• The cell value frequency count of each term in
corresponding document. If a term is present in a
document, then the corresponding cell value
contain 1 else if the term is not present in the
document then the cell value contain 0.
7. TF*IDF
• We should note that a word occurs in most of
the documents might not contribute to
represent the document relevance.
• Whereas less frequency occurred terms might
define document relevance.
• This can achieve using a method known as
term frequency – inverse document frequency
(tf-idf)
8. TF*IDF
• First we calculate TF
• tf =No. of rep in a doc / No. of words in doc
• Second we calculate DF
• df= log(No. of documents)/No. of documents containing
words
• Tf-idf = tf*idf
9. • Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good
• tf =No. of rep in a doc / No. of words in doc
doc3
doc2
doc1
1/3
½
½
Good
1/3
0
½
Boy
1/3
1/2
0
girl
10. • Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good
• df= log(No. of documents)/No. of documents containing
words
DF
word
Log(3/3)
Good
Log(3/2)
Boy
Log(3/2)
girl
11. • Tf-idf = tf*idf
doc3
doc2
doc1
1/3
½
½
Good
1/3
0
½
Boy
1/3
1/2
0
girl
DF
word
Log(3/3)=0
Good
Log(3/2)
Boy
Log(3/2)
girl
girl
boy
good
0
½*log(3/2)
0
Doc1
½*log(3/2)
0
0
Doc2
1/3*log(3/2)
1/3*log(3/2)
0
Doc3
12. Example 2
• Document 1: A cat runs behind rat
• Document 2: The dog runs behind cat
• Document 3: The bull runs behind the player
query: rat
Doc 1: (cat, runs, behind, rat)
Doc 2: (dog, runs, behind, cat)
Doc 3: (bull, runs, behind, player)
Query: (rat)
• The relevant document to query = Max ( similarity score between (doc 1,
Query), similarity score between (doc 2, Query))
• Next step is to represent the above created vectors of terms to numerical
format (term document matrix).
15. Advantages of vector space model
• The vector space model has the following
advantages:
1. Allows ranking documents according to their
possible relavance.
2. Allows retrieving items with partial term
overlap.
16. Limitation
• The vector space models has the following
limitation:
1. Query terms are assumed to be independent,
so phrases might not be represented well in
the ranking.
2. Semantic sensitivity ; documents with similar
vocabulary won’t be associated.
17. Models based on the vector space
model
• Models based on and extending the vector space
model include:
1. Generalized vector space model
2. Latent semantic analysis
3. Term
4. Rocchio Classification
5. Random indexing
6. Search Engine Optimization
18. References
1. Büttcher, Stefan; Clarke, Charles L. A.; Cormack,
Gordon V. (2016). Information retrieval:
implementing and evaluating search engines
(First MIT Press paperback ed.). Cambridge,
Massachusetts London, England: The MIT Press.
ISBN 978-0-262-52887-0.
2. G. Salton , A. Wong , C. S. Yang, A vector space
model for automatic indexing, Communications
of the ACM, v.18 n.11, p.613–620, Nov. 1975
3. https://en.wikipedia.org/wiki/Vector_space_mo
del#cite_ref-:0_1-0