Upcoming SlideShare
×

# Vsm lsi

1,325 views
1,225 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
1,325
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
6
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Vsm lsi

1. 1. Vector Space Model & Lantent Semantic Indexing Ryan Reck November 18, 2008
2. 2. 1 Introduction2 Vector Space Model3 Lantent Semantic Indexing4 Applications of VSM & LSI5 Comparison: VSM vs. LSI6 Conclusion7 References
3. 3. IntroductionWhat are VSM & LSI? VSM & LSI are techniques from information retrievel for managing documents based on their content.
4. 4. Vector Space Model Models documents as a vector in a multi-dimensional space. Similar documents are closer together, angle between vectors can be interpretted as similarity of two documents. Queries are translated into the vector space, and the nearest documents (point in space, or vector angle) are the desired documents. Originated from the SMART Information Retrieval project at Cornell University. First published paper in 1975 [2].
5. 5. Vector Space ModelExample doc1 =< tf1 , tf2 , tf3 , . . . , tfn > doc2 =< tf1 , tf2 , tf3 , . . . , tfn > sim(doc1 , doc2 ) = cos(θ) = v0 · v1
6. 6. Vector Space ModelCalcuating Term Weights VSM introduced the Term Frequency - Inverse Document Frequency method of calculating term weights. TF-IDF gives greater weight to less common terms, and less weight to common ones, since rare terms will better distinguish documents than common terms. |D| Wf ,d = tft · log ( |t∈d|
7. 7. Lantent Semantic Indexing Built oﬀ of Vector Space Model. Extracts concepts from the term-document matrix. Combines corelated dimensions into a single aggrgate dimension. This allows the documents to be indexed by concept instead of simple terms.
8. 8. Lantent Semantic IndexingExample Good Example {computer , laptop} − > {1.2 ∗ computer + 0.9 ∗ laptop} Realistic Example {computer , elevator } − > {1.2 ∗ computer + 0.9 ∗ elevator }
9. 9. Applications of VSM & LSI VSM, or variations of it, are almost universal. Search Engines Apache Lucene
10. 10. Comparison: VSM vs. LSI Advantages of LSI Handles synonymy and polysemy directly Can match documents using diﬀering vocabularies. Can even match across diﬀerent languages, after some translated documents have been handled[1]. Advantages of VSM Much simpler, but still performs well Handles new documents more easily, LSI’s dimension reduction can cause problems with this.
11. 11. Conslusion VSM and LSI are both good ways to index and compare documents. VSM is pretty basic but still gets the job done. LSI provides a more complex system, but it can do a very good job, even under extreme circumstances, like multi-language datasets.
12. 12. Refeences Dumais, S. T., Letsche, T. A., Littman, M. L., and Landauer, T. K. Automatic cross-language retrieval using latent semantic indexing. In AAAI Symposium on CrossLanguage Text and Speech Retrieval. American Association for Artiﬁcial Intelligence, March 1997. (March 1997). Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic indexing. Commun. ACM 18, 11 (1975), 613–620. Latent semantic indexing, 2008. http://en.wikipedia.com/wiki/Latent semantic indexing. Vector space model, 2008. http://en.wikipedia.com/wiki/Vector space model.