Latent Semantic Indexing For Information Retrieval

7,267 views

Published on

Introducing Latent Semantic Analysis through Singular Value Decomposition on Text Data for Information Retrieval

Published in: Technology, Education

Latent Semantic Indexing For Information Retrieval

  1. 1. Latent Semantic Indexing Sudarsun. S., M.Tech Checktronix India Pvt Ltd, Chennai 600034 [email_address]
  2. 2. What is NLP ? <ul><li>What is Natural Language ? </li></ul><ul><li>Can a machine understand NL ? </li></ul><ul><li>How are we understanding NL ? </li></ul><ul><li>How can we make a machine understand NL ? </li></ul><ul><li>What are the limitations ? </li></ul>
  3. 3. Major Entities … <ul><li>What is Syntactic Analysis ? </li></ul><ul><ul><li>Deal Synonymy </li></ul></ul><ul><ul><li>Deal Polysemy ? </li></ul></ul><ul><li>What is Semantics ? </li></ul><ul><ul><li>Represent meanings as a Semantic Net </li></ul></ul><ul><li>What is Knowledge ? </li></ul><ul><ul><li>How to represent knowledge ? </li></ul></ul><ul><li>What are Inferences and Reasoning ? </li></ul><ul><ul><li>How to use the accumulated knowledge ? </li></ul></ul>
  4. 4. LSA for Information Retrieval <ul><li>What is LSA? </li></ul><ul><li>Singular Value Decomposition </li></ul><ul><li>Method of LSA </li></ul><ul><li>Computation of Similarity using Cosine </li></ul><ul><li>Measuring Similarities </li></ul><ul><li>Construction of Pseudo-document </li></ul><ul><li>Limitations of LSA </li></ul><ul><li>Alternatives to LSA </li></ul>
  5. 5. What is LSA <ul><li>A Statistical Method that provides a way to describe the underlying structure of texts </li></ul><ul><li>Used in author recognition, search engines, detecting plagiarism, and comparing texts for similarities </li></ul><ul><li>The contexts in which a certain word exists or does not exist determine the similarity of the documents </li></ul><ul><li>Closely models human learning, especially the manner in which people learn a language and acquire a vocabulary </li></ul>
  6. 6. <ul><li>Multivariate Data Reduction technique. </li></ul><ul><li>Reduces large dataset to a concentrated dataset containing only the significant information from the original data . </li></ul>Singular Value Decomposition
  7. 7. Mathematical Background of SVD <ul><li>SVD decomposes a matrix as a product of 3 matrices. </li></ul><ul><li>Let A be matrix of m x n, then SVD of A is </li></ul><ul><li>SVD(A) = U MxK S KxK V t KxN </li></ul><ul><li>U, V  Left and Right Singular matrices respectively </li></ul><ul><li>U and V are Orthogonal matrix whose vectors are of unit length </li></ul><ul><li>S  Diagonal matrix whose diagonal elements are Singular Values arranged in descending order </li></ul><ul><li>K  Rank of A; K<=min(M,N). </li></ul>
  8. 8. Computation of SVD <ul><li>To Find U,S and V matrices </li></ul><ul><li>Find Eigen Values and their corresponding Eigen Vectors of the matrix AA t </li></ul><ul><li>Singular values = Square root of Eigen Values. </li></ul><ul><li>These Singular values arranged in descending order forms the diagonal elements of the diagonal matrix S . </li></ul><ul><li>Divide each Eigen vector by its length. </li></ul><ul><li>These Eigen vectors forms the columns of the matrix U . </li></ul><ul><li>Similarly Eigen Vectors of the matrix A t A forms the columns of matrix V. </li></ul><ul><li>[ Note : Eigen Values of AA t and A t A are equal.] </li></ul>
  9. 9. Eigen Value & Vectors <ul><li>A scalar Lamba is called an Eigen Value of a matrix A if there is a non-zero vector V such that A.V = Lamba.V. This non-zero vector is the Eigen vector of A. </li></ul><ul><li>Eigen values can be found by solving the equation | A – Lamba.I | = 0. </li></ul>
  10. 10. How to Build LSA ? <ul><li>Preprocess the document collection </li></ul><ul><ul><li>Stemming </li></ul></ul><ul><ul><li>Stop words removal </li></ul></ul><ul><li>Build Frequency Matrix </li></ul><ul><li>Apply Pre-weights </li></ul><ul><li>Decompose FM into U, S, V </li></ul><ul><li>Project Queries </li></ul>
  11. 11. <ul><li>Step #1: Construct the term-document matrix; TDM </li></ul><ul><ul><li>One column for each document </li></ul></ul><ul><ul><li>One row for every word </li></ul></ul><ul><ul><li>The value of cell (i, j) is the frequency of word i in document j </li></ul></ul>Frequency Matrix
  12. 14. <ul><li>Step #2: Weight Functions </li></ul><ul><li>Increase the efficiency of the information retrieval. </li></ul><ul><li>Allocates weights to the terms based on their occurrences. </li></ul><ul><li>Each element is replaced with the product of a Local Weight Function(LWF) and a Global Weight Function(GWF) . </li></ul><ul><ul><li>LWF considers the frequency of a word within a particular text </li></ul></ul><ul><ul><li>GWF examines a term’s frequency across all the documents. </li></ul></ul><ul><li>Pre-weightings </li></ul><ul><li>Applied on the TDM before computing SVD. </li></ul><ul><li>Post-weightings </li></ul><ul><li>Applied to terms of a query when projected for matching or searching. </li></ul>
  13. 15. <ul><li>Step #3: SVD </li></ul><ul><li>Perform SVD on term-document matrix X. </li></ul><ul><li>SVD removes noise or infrequent words that do not help to classify a document. </li></ul><ul><li>Octave/Mat lab can be used </li></ul><ul><ul><li>[u, s, v] = svd(A); </li></ul></ul>
  14. 16. A U S V t m x n m x k k x k k x n · ·  Terms Documents 0 0
  15. 17. Documents TDM SVD Terms U S V
  16. 18. Similarity Computation Using Cosine <ul><li>Consider 2 vectors A & B. Similarity between these 2 vectors is </li></ul><ul><li> A.B </li></ul><ul><li>CosØ = ------------------ </li></ul><ul><li> |A|. |B| </li></ul><ul><li>CosØ ranges between –1 to +1 </li></ul>
  17. 19. Similarity Computations in LSA
  18. 20. Term-term Similarity <ul><li>Compute the Cosine for the row vectors of term ‘i’ and term ‘j’ in the U*S matrix. </li></ul><ul><li>US </li></ul>
  19. 21. Document – Document Similarity <ul><li>Compute the Cosine for the column vectors of document ‘i’ and document ‘j’ in the S*V t matrix. </li></ul>SV t
  20. 22. Term – Document Similarity <ul><li>Compute Cosine between row vector of term ‘i’ in U*S 1/2 matrix and column vector of document ‘j’ in S 1/2 *V t matrix. </li></ul>
  21. 23. U*S 1/2 S 1/2 *V t
  22. 24. Construction of Pseudo-document <ul><li>A Query is broken in to terms and represented as a column vector (say ‘ q ’ ) consisting of ‘M’ terms as rows. </li></ul><ul><li>Then Pseudo-document ( Q ) for the query( q ) can be constructed with the help of following mathematical formula. </li></ul><ul><li>Q = qt*U*S -1 </li></ul><ul><li>After constructing the Pseudo-document, we can compute the similarities of query-term, query-document using earlier mentioned techniques. </li></ul>
  23. 25. Alternatives to LSA <ul><li>LSA is limited to Synonymy problem </li></ul><ul><li>PLSA – Probabilistic Latent Semantic Analysis to handle Polysemy. </li></ul><ul><li>LDA – Latent Dirichlet Allocation. </li></ul>
  24. 26. References <ul><li>http://www.cs.utk.edu/~lsi/papers/ </li></ul><ul><li>http://www.cs.utk.edu/~berry/lsi++ </li></ul><ul><li>http://people.csail.mit.edu/fergus/iccv2005/bagwords.html </li></ul><ul><li>http://research.nitle.org/lsi/lsa_explanation.htm </li></ul><ul><li>http://en.wikipedia.org/wiki/Latent_semantic_analysis </li></ul><ul><li>http://www-psych.nmsu.edu/~pfoltz/reprints/BRMIC96.html </li></ul><ul><li>http://www.pcug.org.au/~jdowling/ </li></ul><ul><li>http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/PCA_1.htm </li></ul><ul><li>http://public.lanl.gov/mewall/kluwer2002.html </li></ul><ul><li>http://www.cs.utexas.edu/users/suvrit/work/progs/ssvd.html </li></ul>
  25. 27. Thanks.. <ul><li>You may send in your queries to sudar@burning-glass.com </li></ul>

×