SlideShare a Scribd company logo
1 of 18
Vector Space Model
Gamela Nageh
Vector Space Model
• Vector Space Model can be used for search engines
and document retrieval system.
• Given a set of documents and search terms/ query
we need to retrieve relevant documents that are
similar to the search query.
Documents
Relevant
documents
Search query
Steps of Vector Space Model
• A vector space model is an algebraic model,
involving two steps:
• In first step we represent the text document
into vector of words and in
• Second step we transform to numerical format
so that we can apply any text mining
techniques such as information retrieval,
information extraction, information filtering
etc.
Example of vector space model
• Let us understand with example. Consider
below statements:
• Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good
Document vectors representation
• First step in this step includes breaking each
document into words, applying preprocessing
steps such as removing stopwords, punctuation,
special characters etc.
document 1: (good, boy)
document 2: (good, girl)
document 3: (good, boy, girl)
• Next step is to represent the above created
vectors of terms to numerical format known as
term document matrix.
Term Document Matrix.
• A term document matrix is a way of represent
document vectors in a matrix format in which
each row represent term vectors across all the
document and columns represent document
vectors across all the terms.
• The cell value frequency count of each term in
corresponding document. If a term is present in a
document, then the corresponding cell value
contain 1 else if the term is not present in the
document then the cell value contain 0.
TF*IDF
• We should note that a word occurs in most of
the documents might not contribute to
represent the document relevance.
• Whereas less frequency occurred terms might
define document relevance.
• This can achieve using a method known as
term frequency – inverse document frequency
(tf-idf)
TF*IDF
• First we calculate TF
• tf =No. of rep in a doc / No. of words in doc
• Second we calculate DF
• df= log(No. of documents)/No. of documents containing
words
• Tf-idf = tf*idf
• Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good
• tf =No. of rep in a doc / No. of words in doc
doc3
doc2
doc1
1/3
½
½
Good
1/3
0
½
Boy
1/3
1/2
0
girl
• Document 1: good boy
• Document 2: good girl
• Document 3: boy girl good
• df= log(No. of documents)/No. of documents containing
words
DF
word
Log(3/3)
Good
Log(3/2)
Boy
Log(3/2)
girl
• Tf-idf = tf*idf
doc3
doc2
doc1
1/3
½
½
Good
1/3
0
½
Boy
1/3
1/2
0
girl
DF
word
Log(3/3)=0
Good
Log(3/2)
Boy
Log(3/2)
girl
girl
boy
good
0
½*log(3/2)
0
Doc1
½*log(3/2)
0
0
Doc2
1/3*log(3/2)
1/3*log(3/2)
0
Doc3
Example 2
• Document 1: A cat runs behind rat
• Document 2: The dog runs behind cat
• Document 3: The bull runs behind the player
query: rat
Doc 1: (cat, runs, behind, rat)
Doc 2: (dog, runs, behind, cat)
Doc 3: (bull, runs, behind, player)
Query: (rat)
• The relevant document to query = Max ( similarity score between (doc 1,
Query), similarity score between (doc 2, Query))
• Next step is to represent the above created vectors of terms to numerical
format (term document matrix).
query
Document 2
Document 1
Words/documents
0
1
1
Cat
0
1
1
Runs
0
1
1
Behind
1
0
1
Rat
0
1
0
dog
Doc 1: (cat, runs, behind, rat)
Doc 2: (dog, runs, behind, cat)
Doc 3: (bull, runs, behind, player)
Query: (rat)
idf= log(n/df)
Document frequency (df)
0
2
0
2
0
2
0.30103
1
0.30103
1
query
doc2
doc1
Words/documents
0
0
0
Cat
0
0
0
Runs
0
0
0
Behind
0.30103
0
0.30103
Rat
0
0.30103
0
dog
Advantages of vector space model
• The vector space model has the following
advantages:
1. Allows ranking documents according to their
possible relavance.
2. Allows retrieving items with partial term
overlap.
Limitation
• The vector space models has the following
limitation:
1. Query terms are assumed to be independent,
so phrases might not be represented well in
the ranking.
2. Semantic sensitivity ; documents with similar
vocabulary won’t be associated.
Models based on the vector space
model
• Models based on and extending the vector space
model include:
1. Generalized vector space model
2. Latent semantic analysis
3. Term
4. Rocchio Classification
5. Random indexing
6. Search Engine Optimization
References
1. Büttcher, Stefan; Clarke, Charles L. A.; Cormack,
Gordon V. (2016). Information retrieval:
implementing and evaluating search engines
(First MIT Press paperback ed.). Cambridge,
Massachusetts London, England: The MIT Press.
ISBN 978-0-262-52887-0.
2. G. Salton , A. Wong , C. S. Yang, A vector space
model for automatic indexing, Communications
of the ACM, v.18 n.11, p.613–620, Nov. 1975
3. https://en.wikipedia.org/wiki/Vector_space_mo
del#cite_ref-:0_1-0

More Related Content

Similar to Vector space model12345678910111213.pptx

vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptpepe3059
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...csandit
 
Recommender systems
Recommender systemsRecommender systems
Recommender systemsVenkat Raman
 
Indexing, vector spaces, search engines
Indexing, vector spaces, search enginesIndexing, vector spaces, search engines
Indexing, vector spaces, search enginesXYLAB
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measuresankit_ppt
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 

Similar to Vector space model12345678910111213.pptx (20)

vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Reflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data systemReflected Intelligence: Lucene/Solr as a self-learning data system
Reflected Intelligence: Lucene/Solr as a self-learning data system
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
 
Text features
Text featuresText features
Text features
 
Recommender systems
Recommender systemsRecommender systems
Recommender systems
 
Indexing, vector spaces, search engines
Indexing, vector spaces, search enginesIndexing, vector spaces, search engines
Indexing, vector spaces, search engines
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
Text mining
Text miningText mining
Text mining
 
Text similarity measures
Text similarity measuresText similarity measures
Text similarity measures
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 

Recently uploaded

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationBoston Institute of Analytics
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project PresentationPredicting Employee Churn: A Data-Driven Approach Project Presentation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

Vector space model12345678910111213.pptx

  • 2. Vector Space Model • Vector Space Model can be used for search engines and document retrieval system. • Given a set of documents and search terms/ query we need to retrieve relevant documents that are similar to the search query. Documents Relevant documents Search query
  • 3. Steps of Vector Space Model • A vector space model is an algebraic model, involving two steps: • In first step we represent the text document into vector of words and in • Second step we transform to numerical format so that we can apply any text mining techniques such as information retrieval, information extraction, information filtering etc.
  • 4. Example of vector space model • Let us understand with example. Consider below statements: • Document 1: good boy • Document 2: good girl • Document 3: boy girl good
  • 5. Document vectors representation • First step in this step includes breaking each document into words, applying preprocessing steps such as removing stopwords, punctuation, special characters etc. document 1: (good, boy) document 2: (good, girl) document 3: (good, boy, girl) • Next step is to represent the above created vectors of terms to numerical format known as term document matrix.
  • 6. Term Document Matrix. • A term document matrix is a way of represent document vectors in a matrix format in which each row represent term vectors across all the document and columns represent document vectors across all the terms. • The cell value frequency count of each term in corresponding document. If a term is present in a document, then the corresponding cell value contain 1 else if the term is not present in the document then the cell value contain 0.
  • 7. TF*IDF • We should note that a word occurs in most of the documents might not contribute to represent the document relevance. • Whereas less frequency occurred terms might define document relevance. • This can achieve using a method known as term frequency – inverse document frequency (tf-idf)
  • 8. TF*IDF • First we calculate TF • tf =No. of rep in a doc / No. of words in doc • Second we calculate DF • df= log(No. of documents)/No. of documents containing words • Tf-idf = tf*idf
  • 9. • Document 1: good boy • Document 2: good girl • Document 3: boy girl good • tf =No. of rep in a doc / No. of words in doc doc3 doc2 doc1 1/3 ½ ½ Good 1/3 0 ½ Boy 1/3 1/2 0 girl
  • 10. • Document 1: good boy • Document 2: good girl • Document 3: boy girl good • df= log(No. of documents)/No. of documents containing words DF word Log(3/3) Good Log(3/2) Boy Log(3/2) girl
  • 11. • Tf-idf = tf*idf doc3 doc2 doc1 1/3 ½ ½ Good 1/3 0 ½ Boy 1/3 1/2 0 girl DF word Log(3/3)=0 Good Log(3/2) Boy Log(3/2) girl girl boy good 0 ½*log(3/2) 0 Doc1 ½*log(3/2) 0 0 Doc2 1/3*log(3/2) 1/3*log(3/2) 0 Doc3
  • 12. Example 2 • Document 1: A cat runs behind rat • Document 2: The dog runs behind cat • Document 3: The bull runs behind the player query: rat Doc 1: (cat, runs, behind, rat) Doc 2: (dog, runs, behind, cat) Doc 3: (bull, runs, behind, player) Query: (rat) • The relevant document to query = Max ( similarity score between (doc 1, Query), similarity score between (doc 2, Query)) • Next step is to represent the above created vectors of terms to numerical format (term document matrix).
  • 13. query Document 2 Document 1 Words/documents 0 1 1 Cat 0 1 1 Runs 0 1 1 Behind 1 0 1 Rat 0 1 0 dog Doc 1: (cat, runs, behind, rat) Doc 2: (dog, runs, behind, cat) Doc 3: (bull, runs, behind, player) Query: (rat)
  • 14. idf= log(n/df) Document frequency (df) 0 2 0 2 0 2 0.30103 1 0.30103 1 query doc2 doc1 Words/documents 0 0 0 Cat 0 0 0 Runs 0 0 0 Behind 0.30103 0 0.30103 Rat 0 0.30103 0 dog
  • 15. Advantages of vector space model • The vector space model has the following advantages: 1. Allows ranking documents according to their possible relavance. 2. Allows retrieving items with partial term overlap.
  • 16. Limitation • The vector space models has the following limitation: 1. Query terms are assumed to be independent, so phrases might not be represented well in the ranking. 2. Semantic sensitivity ; documents with similar vocabulary won’t be associated.
  • 17. Models based on the vector space model • Models based on and extending the vector space model include: 1. Generalized vector space model 2. Latent semantic analysis 3. Term 4. Rocchio Classification 5. Random indexing 6. Search Engine Optimization
  • 18. References 1. Büttcher, Stefan; Clarke, Charles L. A.; Cormack, Gordon V. (2016). Information retrieval: implementing and evaluating search engines (First MIT Press paperback ed.). Cambridge, Massachusetts London, England: The MIT Press. ISBN 978-0-262-52887-0. 2. G. Salton , A. Wong , C. S. Yang, A vector space model for automatic indexing, Communications of the ACM, v.18 n.11, p.613–620, Nov. 1975 3. https://en.wikipedia.org/wiki/Vector_space_mo del#cite_ref-:0_1-0