2. PLSA
• A type of latent variable model with observed
count data and nominal latent variable(s).
• Despite the adjec)ve ‘seman)c’ in the acronym,
the method is not inherently about meaning.
– Not any more than, say, its cousin Latent Class
Analysis
• Rather, the name must be read as P + LS(A|I),
marking the genealogy of PLSA as a probabilis)c
re‐cast of Latent Seman)c Analysis/Indexing.
3. LSA
• Factoriza)on of data matrix into orthogonal
matrices to form bases of (seman)c) vector
space:
• Reduc)on of original matrix to lower‐rank:
• LSA for text complexity: cosine similarity between
paragraphs.
5. Probabili)es Why?
• Probabilis)c systems allow for the evalua)on of
proposi)ons under condi)ons of uncertainty.
Probabilis)c seman)cs.
• Probabilis)c systems provide a uniform mechanism for
integra)ng and reasoning over heterogeneous
informa)on.
– In PLSA seman)c dimensions are represented by unigram
language models, more transparent than eigenvectors.
– The latent variable structure allows for subtopics
(hierarchical PLSA)
• “If the weather is sunny tomorrow and I’m not )red we
will go to the beach”
– p(beach) = p(sunny & ~)red) = p(sunny)(1‐p()red))
6. A Genera)ve Model?
• Let X be a random vector with components {X1,
X2, … , Xn} random variables.
• Each realiza)on of X is assigned to a class, one of
a random variable Y.
• A genera(ve model tells a story about how the
Xs came about: “once upon a )me, a Y was
selected, then Xs were created out of that Y”.
• A discrimina(ve model strives to iden)fy, as
unambiguously as possible, the Y value for some
given X
8. A Genera)ve Model?
• A classic genera)ve/discrimina)ve pair: Naïve
Bayes vs Logis)c Regression.
• Naïve Bayes assumes that the Xis are
condi)onally independent given Y, so it es)mates
P(Xi | Y).
• Logis)c regression makes other assump)ons, e.g.
linearity of the independent variables with logit
of dependent, independence of errors, but
handles correlated predictors (up to perfect
collinearity).
9. A Genera)ve Model?
• Genera)ve models have richer probabilis)c
seman)cs.
– Func)ons run both way.
– Assign distribu)ons to the “independent” variables,
even previously unseen realiza)ons.
• Ng and Jordan (2002) show that logis)c
regression has higher asympto)c accuracy, but
converges more slowly, sugges)ng a trade‐off
between accuracy and variance.
• Overall trade‐off between accuracy and
usefulness.
11. A Genera)ve Model?
• The observed data are cells of document‐term matrix
– We generate (doc, word) pairs.
– Random variables D, W and Z as sources of objects
• Either:
– Draw a document, draw a topic from the document, draw
a word from the topic.
– Draw a topic, draw a document from the topic, draw a
word from the topic.
• The two models are sta)s)cally equivalent
– Will generate iden)cal likelihoods when fit
– Proof by Bayesian inversion
• In any case D and W are condi)onally independent
given Z.
13. A Genera)ve Model?
• But what is a Document here?
– Just a label! There are no anributes associated with
documents.
– P(D|Z) relates topics to labels
• A previously unseen document is just a new label
• Therefore PLSA isn’t genera)ve in an interes)ng
way, as it cannot handle previously unseen inputs
in a genera)ve manner.
– Though the P(Z) distribu)on may s)ll be of interest.
21. Es)ma)ng the Parameters
• The resul)ng func)on, Q(θ), is the condi)onal
expecta)on of the complete data likelihood with
respect to the distribu)on P(Z|D,W).
• It turns out that if we find the parameters that
maximize Q we get a bener es)mate of the
parameters!
• Expressions for the parameters can be had by
sesng the par)al deriva)ves with respect to the
parameters to zero and solving, using Laplace
transforms.
23. Es)ma)ng the Parameters
• Concretely, we generate (randomly)
θ1 = {Pθ1(Z); Pθ1(D|Z); Pθ1(W|Z)} .
• Compute the posterior Pθ1(Z|W,D).
• Compute new parameters θ2 .
• Repeat un)l “convergence”, say un)l the log
likelihood stops changing a lot, or un)l
boredom, or some N itera)ons.
• For stability, average over mul)ple starts,
varying numbers of topics.
24. Folding In
• When a new document comes along, we want to
es)mate the posterior of the topics for the
document.
– What is it about? I.e. what is the distribu)on over
topics of the new document?
• Perform a “linle EM”:
– E‐step: compute P(Z|W, Dnew)
– M‐step: compute P(Z|Dnew) keeping all other
parameters unchanged.
– Converges very fast, five itera)ons?
– Overtly discrimina)ve! The true colors of the method
emerge.
25. Problems with PLSA
• Easily huge number of parameters
– Leads to unstable es)ma)on (local maxima).
– Computa)onally intractable because of huge
matrices
– Modeling the documents directly can be problem
• What if the collec)on has millions of documents?
• Not properly genera)ve (is this a problem?)
26. Examples of Applica)ons
• Informa)on Retrieval: compare topic
distribu)ons for documents and queries using
a similarity measure like rela)ve entropy.
• Collabora)ve Filtering (Hoffman, 2002) using
Gaussian PLSA.
• Topic segmenta)on in texts, by looking for
spikes in the distances between topic
distribu)ons for neighbouring text blocks.