This document provides an introduction to Latent Semantic Analysis (LSA) and Probabilistic Latent Semantic Analysis (PLSA). It discusses some challenges with traditional information retrieval based on lexical matching. It then introduces LSA and PLSA as statistical approaches to address these challenges by discovering the latent semantic structure in word usage. The key aspects of LSA include using singular value decomposition to reduce the dimensionality of a term-document matrix. PLSA builds upon LSA by introducing a probabilistic model with a latent class variable to represent topics. The document contrasts the objective functions and interpretations of the reduced dimensional spaces between LSA and PLSA.
2. CITATION
LSA:
[1] M. W. Berry, S. T. Dumais, and T. A. Letsche, “Computational
methods for intelligent information access,” in Supercomputing‘95:
Proceedings of the 1995 ACM/IEEE Conference on Supercomputing, 1995,
pp. 20-20: IEEE. (cited by 179)
[2] S. T. Dumais, "Latent semantic analysis," Annual review of
information science and technology ,” vol. 38, no. 1, pp. 188-230, 2004.
(cited by 708)
PLSA:
[3] T. Hofmann, "Probabilistic latent semantic indexing," in ACM SIGIR
Forum, 2017, vol. 51, no. 2, pp. 211-218: ACM. (cited by 5520)
[4] T. Hofmann, "Unsupervised learning by probabilistic latent
semantic analysis," Machine learning, vol. 42, no. 1-2, pp. 177-196, 2001.
(cited by 2615)
4. INTRODUCTION
Information retrieval
Typical:lexical match between words in users’ requests and
those in or assigned to documents in a database.
Problem:Fundamental characteristics of human word usage
underlie these retrieval failures --- people generate the same
keyword to describe well-known objects only 20 percent of
the time
→People use a wide variety of words to describe the same
object or concept (synonymy).
Ex. “human-computer interaction ” vs “man-machine study”
5. INTRODUCTION
Solutions
Stemming:converting words to their morphological root.
Ex. “retrieving” “retrieval”→ retrieve, not morphologically related?
Controlled Vocabulary:requiring that query and index terms belong to
a pre-defined set of terms ,time-consuming manual process ?
LSA:
1. fully unsupervised learning, automatic statistical approach
2. latent structure in word usage obscured by variability in word choice.
→ SVD:subspace represents important associative relationships
between terms and documents that are not evident in individual
documents.
6. SVD
Given any m x n matrix A with rank r, it can be factorlized as
A = 𝑈𝛴𝑉 𝑇 =
𝑖=1
𝑟
𝑢𝑖 𝜎𝑖 𝑣𝑖
𝑇
𝑈:diagonalizing matrix for 𝐴𝐴 𝑇,containing orthogonal
eigenvectors for 𝐴𝐴 𝑇
𝛴 :positive, singular value of 𝐴,square roots of
eigenvalues of 𝐴𝐴 𝑇 and 𝐴 𝑇 𝐴
𝑉:diagonalizing matrix for 𝐴 𝑇 𝐴,containing orthogonal
eigenvectors for 𝐴𝐴 𝑇
8. LSA (LATENT SEMANTIC ANALYSIS)
1. Term-Document Matrix. Rows are individual words and
columns are documents.
2. Transformed Matrix. Ex. TF-IDF=term
frequency x Inverse document frequency
3.Dimension Reduction-SVD comes into play !
9. LSA (LATENT SEMANTIC ANALYSIS)
Since the number of dimensions, k, is smaller than the number of
unique terms, m, minor differences will be ignored. Terms which
occur in similar documents will be near each other in the k-
dimensional factor space. Some documents which do not share any
words with a users query may none the less be near it in k-space.
Make no use of linguistic techniques for analyzing morphological,
syntactic, or semantic relations and humanly constructed resources
like dictionaries,. Its only input is large amounts of texts.
Document-document, term-term, and term-document
similarities are computed in the reduced dimensional
approximation to A.
11. PLSA (PROBABILISTIC LSA)
An aspect model-a latent variable model which associates an unobserved
class variable 𝑧 𝑘 ∈ 𝑧1 … 𝑧 𝑘 (k ≪ M, N) with each observation: the
occurrence of a word 𝑤 ∈ 𝑊 = 𝑤1 … 𝑤 𝑀 in a particular document 𝑑 ∈ 𝐷 =
𝑑1 … 𝑑 𝑁 .
1. select a document d with probability 𝑃(𝑑),
2. pick a latent class z with probability 𝑃(𝑧|𝑑),
3. generate a word w with probability 𝑃(𝑤|𝑧).
→ obtains an observed pair (𝑑, 𝑤), while latent class variable 𝑧 is discarded
→ Translating this process into a joint probability model results in expression
𝑃(𝑑, 𝑤) = 𝑃(𝑑)𝑃(𝑤|𝑑) = 𝑧∈𝑍 𝑃(𝑧)𝑃 𝑤 𝑧 𝑃 𝑑 𝑧
12. PLSA (PROBABILISTIC LSA)
1. one determines 𝑃 𝑧 , 𝑃 𝑤 𝑧 , 𝑃 𝑑 𝑧 by maximization of the log
likelihood function :L =
𝑑∈𝐷 𝑤∈𝑊 𝑛 𝑑, 𝑤 𝑙𝑜𝑔𝑃 𝑑, 𝑤 (𝑚𝑢𝑙𝑖𝑡𝑛𝑜𝑚𝑖𝑎𝑙 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛)
2. Standard procedure for MLE estimation in latent variable models-
Expectation Maximization (EM) algorithm.
(i) E-step 𝑃 𝑧 𝑑, 𝑤 =
𝑃 𝑧 𝑃 𝑤 𝑧 𝑃 𝑑 𝑧
𝑧′ 𝑃 𝑧′ 𝑃 𝑤 𝑧′ 𝑃 𝑑 𝑧′
(ii)M-step P w z = 𝑑 𝑛 𝑑,𝑤 𝑃(𝑧|𝑑,𝑤)
𝑑,𝑤′ 𝑛 𝑑,𝑤′ 𝑝(𝑧|𝑑,𝑤′)
, P d z = 𝑤 𝑛 𝑑,𝑤 𝑃(𝑧|𝑑,𝑤)
𝑑′,𝑤
𝑛 𝑑′,𝑤 𝑝(𝑧|𝑑′,𝑤)
𝑃 𝑧 =
1
𝑑,𝑤 𝑛(𝑑,𝑤) 𝑑,𝑤 𝑛 𝑑, 𝑤 𝑃(𝑧|𝑑, 𝑤)
Alternating (i) with (ii) defines a convergent procedure that approaches a
local maximum of the log likelihood in E(L).
14. LSA VS PLSA
1. The objective function utilized to determine the optimal approximation .
LSA:L2- or Frobenius norm, corresponding to an implicit additive Gaussian noise
assumption on (possibly transformed) counts.
PLSA :Likelihood function of multinomial sampling and aims at an explicit
maximization of the predictive power of the model.
2. Interpretation of the directions.
LSA:No obvious interpretation
PLSA :Class-conditional word distributions that define a certain topical context.
3. LSA and PLSA can be applied to a wide range of tasks other than
informational retrieval. Ex. document clustering, literature-based
discovery, and modeling human memory.