Vector Space Model & Lantent Semantic              Indexing              Ryan Reck           November 18, 2008
1 Introduction2 Vector Space Model3 Lantent Semantic Indexing4 Applications of VSM & LSI5 Comparison: VSM vs. LSI6 Conclus...
IntroductionWhat are VSM & LSI?    VSM & LSI are techniques from information retrievel for managing    documents based on ...
Vector Space Model      Models documents as a vector in a multi-dimensional space.      Similar documents are closer toget...
Vector Space ModelExample          doc1 =< tf1 , tf2 , tf3 , . . . , tfn >          doc2 =< tf1 , tf2 , tf3 , . . . , tfn ...
Vector Space ModelCalcuating Term Weights         VSM introduced the Term Frequency - Inverse Document         Frequency m...
Lantent Semantic Indexing      Built off of Vector Space Model.      Extracts concepts from the term-document matrix.      ...
Lantent Semantic IndexingExample    Good Example    {computer , laptop} − >      {1.2 ∗ computer + 0.9 ∗ laptop}    Realis...
Applications of VSM & LSI      VSM, or variations of it, are almost universal.      Search Engines          Apache Lucene
Comparison: VSM vs. LSI  Advantages of LSI      Handles synonymy and polysemy directly      Can match documents using diffe...
Conslusion   VSM and LSI are both good ways to index and compare   documents. VSM is pretty basic but still gets the job d...
Refeences      Dumais, S. T., Letsche, T. A., Littman, M. L., and      Landauer, T. K.      Automatic cross-language retri...
Upcoming SlideShare
Loading in …5
×

Vsm lsi

1,325 views
1,225 views

Published on

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,325
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Vsm lsi

  1. 1. Vector Space Model & Lantent Semantic Indexing Ryan Reck November 18, 2008
  2. 2. 1 Introduction2 Vector Space Model3 Lantent Semantic Indexing4 Applications of VSM & LSI5 Comparison: VSM vs. LSI6 Conclusion7 References
  3. 3. IntroductionWhat are VSM & LSI? VSM & LSI are techniques from information retrievel for managing documents based on their content.
  4. 4. Vector Space Model Models documents as a vector in a multi-dimensional space. Similar documents are closer together, angle between vectors can be interpretted as similarity of two documents. Queries are translated into the vector space, and the nearest documents (point in space, or vector angle) are the desired documents. Originated from the SMART Information Retrieval project at Cornell University. First published paper in 1975 [2].
  5. 5. Vector Space ModelExample doc1 =< tf1 , tf2 , tf3 , . . . , tfn > doc2 =< tf1 , tf2 , tf3 , . . . , tfn > sim(doc1 , doc2 ) = cos(θ) = v0 · v1
  6. 6. Vector Space ModelCalcuating Term Weights VSM introduced the Term Frequency - Inverse Document Frequency method of calculating term weights. TF-IDF gives greater weight to less common terms, and less weight to common ones, since rare terms will better distinguish documents than common terms. |D| Wf ,d = tft · log ( |t∈d|
  7. 7. Lantent Semantic Indexing Built off of Vector Space Model. Extracts concepts from the term-document matrix. Combines corelated dimensions into a single aggrgate dimension. This allows the documents to be indexed by concept instead of simple terms.
  8. 8. Lantent Semantic IndexingExample Good Example {computer , laptop} − > {1.2 ∗ computer + 0.9 ∗ laptop} Realistic Example {computer , elevator } − > {1.2 ∗ computer + 0.9 ∗ elevator }
  9. 9. Applications of VSM & LSI VSM, or variations of it, are almost universal. Search Engines Apache Lucene
  10. 10. Comparison: VSM vs. LSI Advantages of LSI Handles synonymy and polysemy directly Can match documents using differing vocabularies. Can even match across different languages, after some translated documents have been handled[1]. Advantages of VSM Much simpler, but still performs well Handles new documents more easily, LSI’s dimension reduction can cause problems with this.
  11. 11. Conslusion VSM and LSI are both good ways to index and compare documents. VSM is pretty basic but still gets the job done. LSI provides a more complex system, but it can do a very good job, even under extreme circumstances, like multi-language datasets.
  12. 12. Refeences Dumais, S. T., Letsche, T. A., Littman, M. L., and Landauer, T. K. Automatic cross-language retrieval using latent semantic indexing. In AAAI Symposium on CrossLanguage Text and Speech Retrieval. American Association for Artificial Intelligence, March 1997. (March 1997). Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620. Latent semantic indexing, 2008. http://en.wikipedia.com/wiki/Latent semantic indexing. Vector space model, 2008. http://en.wikipedia.com/wiki/Vector space model.

×