Vector Space Model
Gamela Nageh
Vector Space Model
• Vector Space Model can be used for search engines
and document retrieval system.
• Given a set of documents and search terms/ query
we need to retrieve relevant documents that are
similar to the search query.
Documents
Relevant
documents
Search query
Steps of Vector Space Model
• A vector space model is an algebraic model,
involving two steps:
• In first step we represent the text document
into vector of words and in
• Second step we transform to numerical format
so that we can apply any text mining
techniques such as information retrieval,
information extraction, information filtering
etc.
Example of vector space model
• Let us understand with example. Consider
below statements:
• Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good
Document vectors representation
• First step in this step includes breaking each
document into words, applying preprocessing
steps such as removing stopwords, punctuation,
special characters etc.
document 1: (good, boy)
document 2: (good, girl)
document 3: (good, boy, girl)
• Next step is to represent the above created
vectors of terms to numerical format known as
term document matrix.
Term Document Matrix.
• A term document matrix is a way of represent
document vectors in a matrix format in which
each row represent term vectors across all the
document and columns represent document
vectors across all the terms.
• The cell value frequency count of each term in
corresponding document. If a term is present in a
document, then the corresponding cell value
contain 1 else if the term is not present in the
document then the cell value contain 0.
TF*IDF
• We should note that a word occurs in most of
the documents might not contribute to
represent the document relevance.
• Whereas less frequency occurred terms might
define document relevance.
• This can achieve using a method known as
term frequency – inverse document frequency
(tf-idf)
TF*IDF
• First we calculate TF
• tf =No. of rep in a doc / No. of words in doc
• Second we calculate DF
• df= log(No. of documents)/No. of documents containing
words
• Tf-idf = tf*idf
• Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good
• tf =No. of rep in a doc / No. of words in doc
doc3
doc2
doc1
1/3
½
½
Good
1/3
0
½
Boy
1/3
1/2
0
girl
• Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good
• df= log(No. of documents)/No. of documents containing
words
DF
word
Log(3/3)
Good
Log(3/2)
Boy
Log(3/2)
girl
• Tf-idf = tf*idf
doc3
doc2
doc1
1/3
½
½
Good
1/3
0
½
Boy
1/3
1/2
0
girl
DF
word
Log(3/3)=0
Good
Log(3/2)
Boy
Log(3/2)
girl
girl
boy
good
0
½*log(3/2)
0
Doc1
½*log(3/2)
0
0
Doc2
1/3*log(3/2)
1/3*log(3/2)
0
Doc3
Example 2
• Document 1: A cat runs behind rat
• Document 2: The dog runs behind cat
• Document 3: The bull runs behind the player
query: rat
Doc 1: (cat, runs, behind, rat)
Doc 2: (dog, runs, behind, cat)
Doc 3: (bull, runs, behind, player)
Query: (rat)
• The relevant document to query = Max ( similarity score between (doc 1,
Query), similarity score between (doc 2, Query))
• Next step is to represent the above created vectors of terms to numerical
format (term document matrix).
query
Document 2
Document 1
Words/documents
0
1
1
Cat
0
1
1
Runs
0
1
1
Behind
1
0
1
Rat
0
1
0
dog
Doc 1: (cat, runs, behind, rat)
Doc 2: (dog, runs, behind, cat)
Doc 3: (bull, runs, behind, player)
Query: (rat)
idf= log(n/df)
Document frequency (df)
0
2
0
2
0
2
0.30103
1
0.30103
1
query
doc2
doc1
Words/documents
0
0
0
Cat
0
0
0
Runs
0
0
0
Behind
0.30103
0
0.30103
Rat
0
0.30103
0
dog
Advantages of vector space model
• The vector space model has the following
advantages:
1. Allows ranking documents according to their
possible relavance.
2. Allows retrieving items with partial term
overlap.
Limitation
• The vector space models has the following
limitation:
1. Query terms are assumed to be independent,
so phrases might not be represented well in
the ranking.
2. Semantic sensitivity ; documents with similar
vocabulary won’t be associated.
Models based on the vector space
model
• Models based on and extending the vector space
model include:
1. Generalized vector space model
2. Latent semantic analysis
3. Term
4. Rocchio Classification
5. Random indexing
6. Search Engine Optimization
References
1. Büttcher, Stefan; Clarke, Charles L. A.; Cormack,
Gordon V. (2016). Information retrieval:
implementing and evaluating search engines
(First MIT Press paperback ed.). Cambridge,
Massachusetts London, England: The MIT Press.
ISBN 978-0-262-52887-0.
2. G. Salton , A. Wong , C. S. Yang, A vector space
model for automatic indexing, Communications
of the ACM, v.18 n.11, p.613–620, Nov. 1975
3. https://en.wikipedia.org/wiki/Vector_space_mo
del#cite_ref-:0_1-0

Vector space model12345678910111213.pptx

  • 1.
  • 2.
    Vector Space Model •Vector Space Model can be used for search engines and document retrieval system. • Given a set of documents and search terms/ query we need to retrieve relevant documents that are similar to the search query. Documents Relevant documents Search query
  • 3.
    Steps of VectorSpace Model • A vector space model is an algebraic model, involving two steps: • In first step we represent the text document into vector of words and in • Second step we transform to numerical format so that we can apply any text mining techniques such as information retrieval, information extraction, information filtering etc.
  • 4.
    Example of vectorspace model • Let us understand with example. Consider below statements: • Document 1: good boy • Document 2: good girl • Document 3: boy girl good
  • 5.
    Document vectors representation •First step in this step includes breaking each document into words, applying preprocessing steps such as removing stopwords, punctuation, special characters etc. document 1: (good, boy) document 2: (good, girl) document 3: (good, boy, girl) • Next step is to represent the above created vectors of terms to numerical format known as term document matrix.
  • 6.
    Term Document Matrix. •A term document matrix is a way of represent document vectors in a matrix format in which each row represent term vectors across all the document and columns represent document vectors across all the terms. • The cell value frequency count of each term in corresponding document. If a term is present in a document, then the corresponding cell value contain 1 else if the term is not present in the document then the cell value contain 0.
  • 7.
    TF*IDF • We shouldnote that a word occurs in most of the documents might not contribute to represent the document relevance. • Whereas less frequency occurred terms might define document relevance. • This can achieve using a method known as term frequency – inverse document frequency (tf-idf)
  • 8.
    TF*IDF • First wecalculate TF • tf =No. of rep in a doc / No. of words in doc • Second we calculate DF • df= log(No. of documents)/No. of documents containing words • Tf-idf = tf*idf
  • 9.
    • Document 1:good boy • Document 2: good girl • Document 3: boy girl good • tf =No. of rep in a doc / No. of words in doc doc3 doc2 doc1 1/3 ½ ½ Good 1/3 0 ½ Boy 1/3 1/2 0 girl
  • 10.
    • Document 1:good boy • Document 2: good girl • Document 3: boy girl good • df= log(No. of documents)/No. of documents containing words DF word Log(3/3) Good Log(3/2) Boy Log(3/2) girl
  • 11.
    • Tf-idf =tf*idf doc3 doc2 doc1 1/3 ½ ½ Good 1/3 0 ½ Boy 1/3 1/2 0 girl DF word Log(3/3)=0 Good Log(3/2) Boy Log(3/2) girl girl boy good 0 ½*log(3/2) 0 Doc1 ½*log(3/2) 0 0 Doc2 1/3*log(3/2) 1/3*log(3/2) 0 Doc3
  • 12.
    Example 2 • Document1: A cat runs behind rat • Document 2: The dog runs behind cat • Document 3: The bull runs behind the player query: rat Doc 1: (cat, runs, behind, rat) Doc 2: (dog, runs, behind, cat) Doc 3: (bull, runs, behind, player) Query: (rat) • The relevant document to query = Max ( similarity score between (doc 1, Query), similarity score between (doc 2, Query)) • Next step is to represent the above created vectors of terms to numerical format (term document matrix).
  • 13.
    query Document 2 Document 1 Words/documents 0 1 1 Cat 0 1 1 Runs 0 1 1 Behind 1 0 1 Rat 0 1 0 dog Doc1: (cat, runs, behind, rat) Doc 2: (dog, runs, behind, cat) Doc 3: (bull, runs, behind, player) Query: (rat)
  • 14.
    idf= log(n/df) Document frequency(df) 0 2 0 2 0 2 0.30103 1 0.30103 1 query doc2 doc1 Words/documents 0 0 0 Cat 0 0 0 Runs 0 0 0 Behind 0.30103 0 0.30103 Rat 0 0.30103 0 dog
  • 15.
    Advantages of vectorspace model • The vector space model has the following advantages: 1. Allows ranking documents according to their possible relavance. 2. Allows retrieving items with partial term overlap.
  • 16.
    Limitation • The vectorspace models has the following limitation: 1. Query terms are assumed to be independent, so phrases might not be represented well in the ranking. 2. Semantic sensitivity ; documents with similar vocabulary won’t be associated.
  • 17.
    Models based onthe vector space model • Models based on and extending the vector space model include: 1. Generalized vector space model 2. Latent semantic analysis 3. Term 4. Rocchio Classification 5. Random indexing 6. Search Engine Optimization
  • 18.
    References 1. Büttcher, Stefan;Clarke, Charles L. A.; Cormack, Gordon V. (2016). Information retrieval: implementing and evaluating search engines (First MIT Press paperback ed.). Cambridge, Massachusetts London, England: The MIT Press. ISBN 978-0-262-52887-0. 2. G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, v.18 n.11, p.613–620, Nov. 1975 3. https://en.wikipedia.org/wiki/Vector_space_mo del#cite_ref-:0_1-0