Introduc)on
to
Probabilis)c

  Latent
Seman)c
Analysis

  NYP
Predic)ve
Analy)cs
Meetup

          June
10,
2010

PLSA

•  A
type
of
latent
variable
model
with
observed

   count
data
and
nominal
latent
variable(s).

•  Despite
the
adjec)ve
‘seman)c’
in
the
acronym,

   the
method
is
not
inherently
about
meaning.

  –  Not
any
more
than,
say,
its
cousin
Latent
Class

     Analysis

•  Rather,
the
name
must
be
read
as
P
+
LS(A|I),

   marking
the
genealogy
of
PLSA
as
a
probabilis)c

   re‐cast
of
Latent
Seman)c
Analysis/Indexing.

LSA

•  Factoriza)on
of
data
matrix
into
orthogonal

   matrices
to
form
bases
of
(seman)c)
vector

   space:



•  Reduc)on
of
original
matrix
to
lower‐rank:



•  LSA
for
text
complexity:
cosine
similarity
between

   paragraphs.

Problems
with
LSA

•  Non‐probabilis)c

•  Fails
to
handle
polysemy.



  –  Polysemy
called
“noise”
in
LSA
literature.

•  Shown
(by
Hofmann)
to
underperform

   compared
to
PLSA
on
IR
task

Probabili)es
Why?

•  Probabilis)c
systems
allow
for
the
evalua)on
of

   proposi)ons
under
condi)ons
of
uncertainty.


   Probabilis)c
seman)cs.

•  Probabilis)c
systems
provide
a
uniform
mechanism
for

   integra)ng
and
reasoning
over
heterogeneous

   informa)on.

   –  In
PLSA
seman)c
dimensions
are
represented
by
unigram

      language
models,
more
transparent
than
eigenvectors.

   –  The
latent
variable
structure
allows
for
subtopics

      (hierarchical
PLSA)

•  “If
the
weather
is
sunny
tomorrow
and
I’m
not
)red
we

   will
go
to
the
beach”

   –  p(beach)
=
p(sunny
&
~)red)
=
p(sunny)(1‐p()red))

A
Genera)ve
Model?

•  Let
X
be
a
random
vector
with
components
{X1,

   X2,
…
,
Xn}
random
variables.

•  Each
realiza)on
of
X
is
assigned
to
a
class,
one
of

   a
random
variable
Y.

•  A
genera(ve
model
tells
a
story
about
how
the

   Xs
came
about:
“once
upon
a
)me,
a
Y
was

   selected,
then
Xs
were
created
out
of
that
Y”.

•  A
discrimina(ve
model
strives
to
iden)fy,
as

   unambiguously
as
possible,
the
Y
value
for
some

   given
X

A
Genera)ve
Model?

•  A
discrimina)ve
model
es)mates
P(Y|X)

   directly.

•  A
genera)ve
model
es)mates
P(X|Y)
and
P(Y)

  –  The
predic)ve
direc)on
is
then
computed
via

     Bayesian
inversion:




    where
P(X)
is
obtained
by
condi)oning
on
Y:


      
   


A
Genera)ve
Model?

•  A
classic
genera)ve/discrimina)ve
pair:
Naïve

   Bayes
vs
Logis)c
Regression.

•  Naïve
Bayes
assumes
that
the
Xis
are

   condi)onally
independent
given
Y,
so
it
es)mates

   P(Xi
|
Y).

•  Logis)c
regression
makes
other
assump)ons,
e.g.

   linearity
of
the
independent
variables
with
logit

   of
dependent,
independence
of
errors,
but

   handles
correlated
predictors
(up
to
perfect

   collinearity).

A
Genera)ve
Model?

•  Genera)ve
models
have
richer
probabilis)c

   seman)cs.



  –  Func)ons
run
both
way.

  –  Assign
distribu)ons
to
the
“independent”
variables,

     even
previously
unseen
realiza)ons.

•  Ng
and
Jordan
(2002)
show
that
logis)c

   regression
has
higher
asympto)c
accuracy,
but

   converges
more
slowly,
sugges)ng
a
trade‐off

   between
accuracy
and
variance.

•  Overall
trade‐off
between
accuracy
and

   usefulness.

A
Genera)ve
Model?

•  Start
with
document
   •  Start
with
topic


          D
    P(D)
               P(D|Z)
   D



                               Z

          Z
    P(Z|D)

                          P(Z)

                                              W

                                    P(W|Z)

          W
    P(W|Z)

A
Genera)ve
Model?

•  The
observed
data
are
cells
of
document‐term
matrix

   –  We
generate
(doc,
word)
pairs.

   –  Random
variables
D,
W
and
Z
as
sources
of
objects

•  Either:

   –  Draw
a
document,
draw
a
topic
from
the
document,
draw

      a
word
from
the
topic.

   –  Draw
a
topic,
draw
a
document
from
the
topic,
draw
a

      word
from
the
topic.

•  The
two
models
are
sta)s)cally
equivalent

   –  Will
generate
iden)cal
likelihoods
when
fit

   –  Proof
by
Bayesian
inversion

•  In
any
case
D
and
W
are
condi)onally
independent

   given
Z.

A
Genera)ve
Model?

A
Genera)ve
Model?

•  But
what
is
a
Document
here?

   –  Just
a
label!

There
are
no
anributes
associated
with

      documents.



   –  P(D|Z)
relates
topics
to
labels

•  A
previously
unseen
document
is
just
a
new
label

•  Therefore
PLSA
isn’t
genera)ve
in
an
interes)ng

   way,
as
it
cannot
handle
previously
unseen
inputs

   in
a
genera)ve
manner.

   –  Though
the
P(Z)
distribu)on
may
s)ll
be
of
interest.

Es)ma)ng
the
Parameters

•  Θ
=
{P(Z);
P(D|Z);
P(W|Z)}

•  All
distribu)ons
refer
to
latent
variable
Z,
so

   cannot
be
es)mated
directly
from
the
data.

•  How
do
we
know
when
we
have
the
right

   parameters?

   –  When
we
have
the
θ
that
most
closely
generates

      the
data,
i.e.
the
document‐term
matrix


Es)ma)ng
the
Parameters





•  The
joint
P(D,W)
generates
the
observed

   document‐term
matrix.

•  The
parameter
vector
θ
yields
the
joint
P(D,W)

•  We
want
θ
that
maximizes
the
probability
of

   the
observed
data.

Es)ma)ng
the
Parameters

•  For
the
mul)nomial
distribu)on,




•  Let
X
be
the
MxN
document‐term
matrix.


Es)ma)ng
the
Parameters

 •  Imagine
we
knew
the
X’
=
MxNxK
complete

    data
matrix,
where
the
counts
for
topics
were

    overt.

Then,





New
and
interes)ng:
      The
usual
parameters
θ

unseen
counts
must
sum

to
1
for
given
d,w

Es)ma)ng
the
Parameters

•  We
can
factorize
the
counts
in
terms
of
the

   observed
counts
and
a
hidden
distribu)on:





•  Let’s
give
the
hidden
distribu)on
its
name:

   P(Z|D,W),
the
posterior
distribu)on
of
Z
w.r.t.

   D,W

Es)ma)ng
the
Parameters

•  P(Z|D,W)
can
be
obtained
from
the

   parameters
via
Bayes
and
our
core
model

   assump)on
of
condi)onal
independence:

Es)ma)ng
the
Parameters

•  Nobody
said
the
genera)on
of
P(Z|D,W)
must

   be
based
on
the
same
parameter
vector
as
the

   one
we’re
looking
for!

•  Say
we
obtain
P(Z|D,W)
based
on
randomly

   generated
parameters
θn
:


•  We
get
a
func)on
of
the
parameters:

Es)ma)ng
the
Parameters

•  The
resul)ng
func)on,
Q(θ),
is
the
condi)onal

   expecta)on
of
the
complete
data
likelihood
with

   respect
to
the
distribu)on
P(Z|D,W).


•  It
turns
out
that
if
we
find
the
parameters
that

   maximize
Q
we
get
a
bener
es)mate
of
the

   parameters!


•  Expressions
for
the
parameters
can
be
had
by

   sesng
the
par)al
deriva)ves
with
respect
to
the

   parameters
to
zero
and
solving,
using
Laplace

   transforms.

Es)ma)ng
the
Parameters

•  E‐step
(misnamed)


•  M‐step

Es)ma)ng
the
Parameters

•  Concretely,
we
generate
(randomly)


       
θ1
=
{Pθ1(Z);
Pθ1(D|Z);
Pθ1(W|Z)}
.


•  Compute
the
posterior
Pθ1(Z|W,D).

•  Compute
new
parameters
θ2
.


•  Repeat
un)l
“convergence”,
say
un)l
the
log

   likelihood
stops
changing
a
lot,
or
un)l

   boredom,
or
some
N
itera)ons.

•  For
stability,
average
over
mul)ple
starts,

   varying
numbers
of
topics.

Folding
In

•  When
a
new
document
comes
along,
we
want
to

   es)mate
the
posterior
of
the
topics
for
the

   document.

   –  What
is
it
about?

I.e.
what
is
the
distribu)on
over

      topics
of
the
new
document?

•  Perform
a
“linle
EM”:


   –  E‐step:
compute
P(Z|W,
Dnew)

   –  M‐step:
compute
P(Z|Dnew)
keeping
all
other

      parameters
unchanged.

   –  Converges
very
fast,
five
itera)ons?

   –  Overtly
discrimina)ve!

The
true
colors
of
the
method

      emerge.

Problems
with
PLSA

•  Easily
huge
number
of
parameters

  –  Leads
to
unstable
es)ma)on
(local
maxima).

  –  Computa)onally
intractable
because
of
huge

     matrices

  –  Modeling
the
documents
directly
can
be
problem

     •  What
if
the
collec)on
has
millions
of
documents?

•  Not
properly
genera)ve
(is
this
a
problem?)

Examples
of
Applica)ons

•  Informa)on
Retrieval:
compare
topic

   distribu)ons
for
documents
and
queries
using

   a
similarity
measure
like
rela)ve
entropy.

•  Collabora)ve
Filtering
(Hoffman,
2002)
using

   Gaussian
PLSA.

•  Topic
segmenta)on
in
texts,
by
looking
for

   spikes
in
the
distances
between
topic

   distribu)ons
for
neighbouring
text
blocks.


Introduction to Probabilistic Latent Semantic Analysis