Introduction to Probabilistic Latent Semantic Analysis

Introduc)on to Probabilis)c 
Latent Seman)c Analysis 
NYP Predic)ve Analy)cs Meetup 
June 10, 2010

PLSA 
•  A type of latent variable model with observed 
count data and nominal latent variable(s). 
•  Despite the adjec)ve ‘seman)c’ in the acronym, 
the method is not inherently about meaning. 
–  Not any more than, say, its cousin Latent Class 
Analysis 
•  Rather, the name must be read as P + LS(A|I), 
marking the genealogy of PLSA as a probabilis)c 
re‐cast of Latent Seman)c Analysis/Indexing.

LSA 
•  Factoriza)on of data matrix into orthogonal 
matrices to form bases of (seman)c) vector 
space: 

•  Reduc)on of original matrix to lower‐rank: 

•  LSA for text complexity: cosine similarity between 
paragraphs.

Problems with LSA 
•  Non‐probabilis)c 
•  Fails to handle polysemy.   
–  Polysemy called “noise” in LSA literature. 
•  Shown (by Hofmann) to underperform 
compared to PLSA on IR task

Probabili)es Why? 
•  Probabilis)c systems allow for the evalua)on of 
proposi)ons under condi)ons of uncertainty.  
Probabilis)c seman)cs. 
•  Probabilis)c systems provide a uniform mechanism for 
integra)ng and reasoning over heterogeneous 
informa)on. 
–  In PLSA seman)c dimensions are represented by unigram 
language models, more transparent than eigenvectors. 
–  The latent variable structure allows for subtopics 
(hierarchical PLSA) 
•  “If the weather is sunny tomorrow and I’m not )red we 
will go to the beach” 
–  p(beach) = p(sunny & ~)red) = p(sunny)(1‐p()red))

A Genera)ve Model? 
•  Let X be a random vector with components {X1, 
X2, … , Xn} random variables. 
•  Each realiza)on of X is assigned to a class, one of 
a random variable Y. 
•  A genera(ve model tells a story about how the 
Xs came about: “once upon a )me, a Y was 
selected, then Xs were created out of that Y”. 
•  A discrimina(ve model strives to iden)fy, as 
unambiguously as possible, the Y value for some 
given X

•  A discrimina)ve model es)mates P(Y|X) 
directly. 
•  A genera)ve model es)mates P(X|Y) and P(Y) 
–  The predic)ve direc)on is then computed via 
Bayesian inversion:  

where P(X) is obtained by condi)oning on Y:

•  A classic genera)ve/discrimina)ve pair: Naïve 
Bayes vs Logis)c Regression. 
•  Naïve Bayes assumes that the Xis are 
condi)onally independent given Y, so it es)mates 
P(Xi | Y). 
•  Logis)c regression makes other assump)ons, e.g. 
linearity of the independent variables with logit 
of dependent, independence of errors, but 
handles correlated predictors (up to perfect 
collinearity).

•  Genera)ve models have richer probabilis)c 
seman)cs.   
–  Func)ons run both way. 
–  Assign distribu)ons to the “independent” variables, 
even previously unseen realiza)ons. 
•  Ng and Jordan (2002) show that logis)c 
regression has higher asympto)c accuracy, but 
converges more slowly, sugges)ng a trade‐oﬀ 
between accuracy and variance. 
•  Overall trade‐oﬀ between accuracy and 
usefulness.

•  Start with document  •  Start with topic 

D  P(D)  P(D|Z)  D 

Z 
Z  P(Z|D) 
P(Z) 
W 
P(W|Z) 
W  P(W|Z)

•  The observed data are cells of document‐term matrix 
–  We generate (doc, word) pairs. 
–  Random variables D, W and Z as sources of objects 
•  Either: 
–  Draw a document, draw a topic from the document, draw 
a word from the topic. 
–  Draw a topic, draw a document from the topic, draw a 
word from the topic. 
•  The two models are sta)s)cally equivalent 
–  Will generate iden)cal likelihoods when ﬁt 
–  Proof by Bayesian inversion 
•  In any case D and W are condi)onally independent 
given Z.

•  But what is a Document here? 
–  Just a label!  There are no anributes associated with 
documents.   
–  P(D|Z) relates topics to labels 
•  A previously unseen document is just a new label 
•  Therefore PLSA isn’t genera)ve in an interes)ng 
way, as it cannot handle previously unseen inputs 
in a genera)ve manner. 
–  Though the P(Z) distribu)on may s)ll be of interest.

Es)ma)ng the Parameters 
•  Θ = {P(Z); P(D|Z); P(W|Z)} 
•  All distribu)ons refer to latent variable Z, so 
cannot be es)mated directly from the data. 
•  How do we know when we have the right 
parameters? 
–  When we have the θ that most closely generates 
the data, i.e. the document‐term matrix


•  The joint P(D,W) generates the observed 
document‐term matrix. 
•  The parameter vector θ yields the joint P(D,W) 
•  We want θ that maximizes the probability of 
the observed data.

•  For the mul)nomial distribu)on, 

•  Let X be the MxN document‐term matrix.

•  Imagine we knew the X’ = MxNxK complete 
data matrix, where the counts for topics were 
overt.  Then, 

New and interes)ng:  The usual parameters θ 
unseen counts must sum 
to 1 for given d,w

•  We can factorize the counts in terms of the 
observed counts and a hidden distribu)on: 

•  Let’s give the hidden distribu)on its name: 
P(Z|D,W), the posterior distribu)on of Z w.r.t. 
D,W

•  P(Z|D,W) can be obtained from the 
parameters via Bayes and our core model 
assump)on of condi)onal independence:

•  Nobody said the genera)on of P(Z|D,W) must 
be based on the same parameter vector as the 
one we’re looking for! 
•  Say we obtain P(Z|D,W) based on randomly 
generated parameters θn : 

•  We get a func)on of the parameters:

•  The resul)ng func)on, Q(θ), is the condi)onal 
expecta)on of the complete data likelihood with 
respect to the distribu)on P(Z|D,W).  
•  It turns out that if we ﬁnd the parameters that 
maximize Q we get a bener es)mate of the 
parameters! 

•  Expressions for the parameters can be had by 
sesng the par)al deriva)ves with respect to the 
parameters to zero and solving, using Laplace 
transforms.

•  E‐step (misnamed) 

•  M‐step

•  Concretely, we generate (randomly)  
 θ1 = {Pθ1(Z); Pθ1(D|Z); Pθ1(W|Z)} .  
•  Compute the posterior Pθ1(Z|W,D). 
•  Compute new parameters θ2 .  
•  Repeat un)l “convergence”, say un)l the log 
likelihood stops changing a lot, or un)l 
boredom, or some N itera)ons. 
•  For stability, average over mul)ple starts, 
varying numbers of topics.

Folding In 
•  When a new document comes along, we want to 
es)mate the posterior of the topics for the 
document. 
–  What is it about?  I.e. what is the distribu)on over 
topics of the new document? 
•  Perform a “linle EM”:  
–  E‐step: compute P(Z|W, Dnew) 
–  M‐step: compute P(Z|Dnew) keeping all other 
parameters unchanged. 
–  Converges very fast, ﬁve itera)ons? 
–  Overtly discrimina)ve!  The true colors of the method 
emerge.

Problems with PLSA 
•  Easily huge number of parameters 
–  Leads to unstable es)ma)on (local maxima). 
–  Computa)onally intractable because of huge 
matrices 
–  Modeling the documents directly can be problem 
•  What if the collec)on has millions of documents? 
•  Not properly genera)ve (is this a problem?)

Examples of Applica)ons 
•  Informa)on Retrieval: compare topic 
distribu)ons for documents and queries using 
a similarity measure like rela)ve entropy. 
•  Collabora)ve Filtering (Hoﬀman, 2002) using 
Gaussian PLSA. 
•  Topic segmenta)on in texts, by looking for 
spikes in the distances between topic 
distribu)ons for neighbouring text blocks.

Introduction to Probabilistic Latent Semantic Analysis

More Related Content

What's hot

Viewers also liked

Similar to Introduction to Probabilistic Latent Semantic Analysis

More from NYC Predictive Analytics

Introduction to Probabilistic Latent Semantic Analysis