LSA and PLSA Explained

LSA & PLSA
INTRODUCTION
專題報告
台大統計碩士學位學程碩一陳育婷

CITATION
LSA：
[1] M. W. Berry, S. T. Dumais, and T. A. Letsche, “Computational
methods for intelligent information access,” in Supercomputing‘95:
Proceedings of the 1995 ACM/IEEE Conference on Supercomputing, 1995,
pp. 20-20: IEEE. (cited by 179)
[2] S. T. Dumais, "Latent semantic analysis," Annual review of
information science and technology ,” vol. 38, no. 1, pp. 188-230, 2004.
(cited by 708)
PLSA：
[3] T. Hofmann, "Probabilistic latent semantic indexing," in ACM SIGIR
Forum, 2017, vol. 51, no. 2, pp. 211-218: ACM. (cited by 5520)
[4] T. Hofmann, "Unsupervised learning by probabilistic latent
semantic analysis," Machine learning, vol. 42, no. 1-2, pp. 177-196, 2001.
(cited by 2615)

目錄
1.Introduction
2.SVD (singular value decomposition)
3.LSA (latent semantic analysis)
4.PLSA (probabilistic latent semantic analysis)
5.LSA vs PLSA

INTRODUCTION
 Information retrieval
 Typical：lexical match between words in users’ requests and
those in or assigned to documents in a database.
 Problem：Fundamental characteristics of human word usage
underlie these retrieval failures --- people generate the same
keyword to describe well-known objects only 20 percent of
the time
→People use a wide variety of words to describe the same
object or concept (synonymy).
Ex. “human-computer interaction ” vs “man-machine study”

INTRODUCTION
 Solutions
 Stemming：converting words to their morphological root.
Ex. “retrieving” “retrieval”→ retrieve, not morphologically related?
 Controlled Vocabulary：requiring that query and index terms belong to
a pre-defined set of terms ,time-consuming manual process ?
 LSA：
1. fully unsupervised learning, automatic statistical approach
2. latent structure in word usage obscured by variability in word choice.
→ SVD：subspace represents important associative relationships
between terms and documents that are not evident in individual
documents.

SVD
 Given any m x n matrix A with rank r, it can be factorlized as
A = 𝑈𝛴𝑉 𝑇 =
𝑖=1
𝑟
𝑢𝑖 𝜎𝑖 𝑣𝑖
𝑇
 𝑈：diagonalizing matrix for 𝐴𝐴 𝑇,containing orthogonal
eigenvectors for 𝐴𝐴 𝑇
 𝛴 ：positive, singular value of 𝐴,square roots of
eigenvalues of 𝐴𝐴 𝑇 and 𝐴 𝑇 𝐴
 𝑉：diagonalizing matrix for 𝐴 𝑇 𝐴,containing orthogonal
eigenvectors for 𝐴𝐴 𝑇

SVD
 Definition：norms 𝐴 𝐹
2
= 𝜎1
2
+ 𝜎2
2
+ ⋯ + 𝜎𝑟
2, 𝜎1
2
≥ ⋯ ≥ 𝜎𝑟
2
𝐴 𝑘 = 𝑈 𝑘 𝛴 𝑘 𝑉𝑘
𝑇
= 𝑉 𝑇 = 𝑖=1
𝑟
𝑢𝑖 𝜎𝑖 𝑣𝑖
𝑇
 Theorem：min 𝑟𝑎𝑛𝑘 𝐵 =𝑘 𝐴 − 𝐵 𝐹
2
= 𝐴 − 𝐴 𝑘 𝐹
2
= 𝜎 𝑘+1
2
+
⋯ + 𝜎𝑟
2
 𝑝𝑟𝑜𝑜𝑓：min 𝑟𝑎𝑛𝑘 𝐵 =𝑘 𝐴 − 𝐵 𝐹
2
＝ 𝑈𝛴𝑉 𝑇 − 𝐵 𝐹
2
=𝑙𝑒𝑓𝑡:∗𝑈 𝑇,𝑟𝑖𝑔ℎ𝑡:∗𝑉 𝛴 − 𝑈 𝑇 𝐵𝑉 𝐹
2
.
Find min 𝑟𝑎𝑛𝑘 𝐵 =𝑘 𝛴 − 𝑈 𝑇 𝐵𝑉 𝐹
2
,
𝑈 𝑇 𝐵𝑉 = 𝛴 𝑘. ∴ 𝐵 = 𝐴 𝑘 = 𝑈 𝑘 𝛴 𝑘 𝑉𝑘
𝑇
 Use 𝐴 𝑘 to approximate 𝐴

LSA (LATENT SEMANTIC ANALYSIS)
1. Term-Document Matrix. Rows are individual words and
columns are documents.
2. Transformed Matrix. Ex. TF-IDF=term
frequency x Inverse document frequency
3.Dimension Reduction-SVD comes into play !

 Since the number of dimensions, k, is smaller than the number of
unique terms, m, minor differences will be ignored. Terms which
occur in similar documents will be near each other in the k-
dimensional factor space. Some documents which do not share any
words with a users query may none the less be near it in k-space.
 Make no use of linguistic techniques for analyzing morphological,
syntactic, or semantic relations and humanly constructed resources
like dictionaries,. Its only input is large amounts of texts.
 Document-document, term-term, and term-document
similarities are computed in the reduced dimensional
approximation to A.

PLSA (PROBABILISTIC LSA)
 An aspect model-a latent variable model which associates an unobserved
class variable 𝑧 𝑘 ∈ 𝑧1 … 𝑧 𝑘 (k ≪ M, N) with each observation： the
occurrence of a word 𝑤 ∈ 𝑊 = 𝑤1 … 𝑤 𝑀 in a particular document 𝑑 ∈ 𝐷 =
𝑑1 … 𝑑 𝑁 .
1. select a document d with probability 𝑃(𝑑),
2. pick a latent class z with probability 𝑃(𝑧|𝑑),
3. generate a word w with probability 𝑃(𝑤|𝑧).
→ obtains an observed pair (𝑑, 𝑤), while latent class variable 𝑧 is discarded
→ Translating this process into a joint probability model results in expression
𝑃(𝑑, 𝑤) = 𝑃(𝑑)𝑃(𝑤|𝑑) = 𝑧∈𝑍 𝑃(𝑧)𝑃 𝑤 𝑧 𝑃 𝑑 𝑧

1. one determines 𝑃 𝑧 , 𝑃 𝑤 𝑧 , 𝑃 𝑑 𝑧 by maximization of the log
likelihood function ：L =
𝑑∈𝐷 𝑤∈𝑊 𝑛 𝑑, 𝑤 𝑙𝑜𝑔𝑃 𝑑, 𝑤 (𝑚𝑢𝑙𝑖𝑡𝑛𝑜𝑚𝑖𝑎𝑙 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛)
2. Standard procedure for MLE estimation in latent variable models-
Expectation Maximization (EM) algorithm.
(i) E-step 𝑃 𝑧 𝑑, 𝑤 =
𝑃 𝑧 𝑃 𝑤 𝑧 𝑃 𝑑 𝑧
𝑧′ 𝑃 𝑧′ 𝑃 𝑤 𝑧′ 𝑃 𝑑 𝑧′
(ii)M-step P w z = 𝑑 𝑛 𝑑,𝑤 𝑃(𝑧|𝑑,𝑤)
𝑑,𝑤′ 𝑛 𝑑,𝑤′ 𝑝(𝑧|𝑑,𝑤′)
, P d z = 𝑤 𝑛 𝑑,𝑤 𝑃(𝑧|𝑑,𝑤)
𝑑′,𝑤
𝑛 𝑑′,𝑤 𝑝(𝑧|𝑑′,𝑤)
𝑃 𝑧 =
1
𝑑,𝑤 𝑛(𝑑,𝑤) 𝑑,𝑤 𝑛 𝑑, 𝑤 𝑃(𝑧|𝑑, 𝑤)
Alternating (i) with (ii) defines a convergent procedure that approaches a
local maximum of the log likelihood in E(L).

LSA VS PLSA
1. The objective function utilized to determine the optimal approximation .
LSA：L2- or Frobenius norm, corresponding to an implicit additive Gaussian noise
assumption on (possibly transformed) counts.
PLSA ：Likelihood function of multinomial sampling and aims at an explicit
maximization of the predictive power of the model.
2. Interpretation of the directions.
LSA：No obvious interpretation
PLSA ：Class-conditional word distributions that define a certain topical context.
3. LSA and PLSA can be applied to a wide range of tasks other than
informational retrieval. Ex. document clustering, literature-based
discovery, and modeling human memory.

LSA and PLSA Explained

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to LSA and PLSA Explained

Similar to LSA and PLSA Explained (20)

Recently uploaded

Recently uploaded (20)

LSA and PLSA Explained