Vector space model12345678910111213.pptx

Vector Space Model
Gamela Nageh

Vector Space Model
• Vector Space Model can be used for search engines
and document retrieval system.
• Given a set of documents and search terms/ query
we need to retrieve relevant documents that are
similar to the search query.
Documents
Relevant
documents
Search query

Steps of Vector Space Model
• A vector space model is an algebraic model,
involving two steps:
• In first step we represent the text document
into vector of words and in
• Second step we transform to numerical format
so that we can apply any text mining
techniques such as information retrieval,
information extraction, information filtering
etc.

Example of vector space model
• Let us understand with example. Consider
below statements:
• Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good

Document vectors representation
• First step in this step includes breaking each
document into words, applying preprocessing
steps such as removing stopwords, punctuation,
special characters etc.
document 1: (good, boy)
document 2: (good, girl)
document 3: (good, boy, girl)
• Next step is to represent the above created
vectors of terms to numerical format known as
term document matrix.

Term Document Matrix.
• A term document matrix is a way of represent
document vectors in a matrix format in which
each row represent term vectors across all the
document and columns represent document
vectors across all the terms.
• The cell value frequency count of each term in
corresponding document. If a term is present in a
document, then the corresponding cell value
contain 1 else if the term is not present in the
document then the cell value contain 0.

TF*IDF
• We should note that a word occurs in most of
the documents might not contribute to
represent the document relevance.
• Whereas less frequency occurred terms might
define document relevance.
• This can achieve using a method known as
term frequency – inverse document frequency
(tf-idf)

TF*IDF
• First we calculate TF
• tf =No. of rep in a doc / No. of words in doc
• Second we calculate DF
• df= log(No. of documents)/No. of documents containing
words
• Tf-idf = tf*idf

• tf =No. of rep in a doc / No. of words in doc
doc3
doc2
doc1
1/3
½
½
Good
1/3
0
½
Boy
1/3
1/2
0
girl

• df= log(No. of documents)/No. of documents containing
words
DF
word
Log(3/3)
Good
Log(3/2)
Boy
Log(3/2)
girl

• Tf-idf = tf*idf
doc3
doc2
doc1
1/3
½
½
Good
1/3
0
½
Boy
1/3
1/2
0
girl
DF
word
Log(3/3)=0
Good
Log(3/2)
Boy
Log(3/2)
girl
girl
boy
good
0
½*log(3/2)
0
Doc1
½*log(3/2)
0
0
Doc2
1/3*log(3/2)
1/3*log(3/2)
0
Doc3

Example 2
• Document 1: A cat runs behind rat
• Document 2: The dog runs behind cat
• Document 3: The bull runs behind the player
query: rat
Doc 1: (cat, runs, behind, rat)
Doc 2: (dog, runs, behind, cat)
Doc 3: (bull, runs, behind, player)
Query: (rat)
• The relevant document to query = Max ( similarity score between (doc 1,
Query), similarity score between (doc 2, Query))
• Next step is to represent the above created vectors of terms to numerical
format (term document matrix).

query
Document 2
Document 1
Words/documents
0
1
1
Cat
0
1
1
Runs
0
1
1
Behind
1
0
1
Rat
0
1
0
dog
Doc 1: (cat, runs, behind, rat)
Doc 2: (dog, runs, behind, cat)
Doc 3: (bull, runs, behind, player)
Query: (rat)

idf= log(n/df)
Document frequency (df)
0
2
0
2
0
2
0.30103
1
0.30103
1
query
doc2
doc1
Words/documents
0
0
0
Cat
0
0
0
Runs
0
0
0
Behind
0.30103
0
0.30103
Rat
0
0.30103
0
dog

Advantages of vector space model
• The vector space model has the following
advantages:
1. Allows ranking documents according to their
possible relavance.
2. Allows retrieving items with partial term
overlap.

Limitation
• The vector space models has the following
limitation:
1. Query terms are assumed to be independent,
so phrases might not be represented well in
the ranking.
2. Semantic sensitivity ; documents with similar
vocabulary won’t be associated.

Models based on the vector space
model
• Models based on and extending the vector space
model include:
1. Generalized vector space model
2. Latent semantic analysis
3. Term
4. Rocchio Classification
5. Random indexing
6. Search Engine Optimization

References
1. Büttcher, Stefan; Clarke, Charles L. A.; Cormack,
Gordon V. (2016). Information retrieval:
implementing and evaluating search engines
(First MIT Press paperback ed.). Cambridge,
Massachusetts London, England: The MIT Press.
ISBN 978-0-262-52887-0.
2. G. Salton , A. Wong , C. S. Yang, A vector space
model for automatic indexing, Communications
of the ACM, v.18 n.11, p.613–620, Nov. 1975
3. https://en.wikipedia.org/wiki/Vector_space_mo
del#cite_ref-:0_1-0

Vector space model12345678910111213.pptx

Recommended

Recommended

More Related Content

Similar to Vector space model12345678910111213.pptx

Similar to Vector space model12345678910111213.pptx (20)

Recently uploaded

Recently uploaded (20)

Vector space model12345678910111213.pptx