Expectation propagation

Expecta(on
Propaga(on

Theory
and
Applica(on

Dong
Guo

Research
Workshop
2013
Hulu
Internal

See
more
details
in

hEp://dongguo.me/blog/2014/01/01/expecta(on-‐propaga(on/

hEp://dongguo.me/blog/2013/12/01/bayesian-‐ctr-‐predic(on-‐for-‐bing/

Outline

• 
• 
• 
• 

Overview

Background

Theory

Applica(ons

Bayesian
Paradigm

•  Infer
posterior
distribu(on

Prior

Posterior

Make
decision

Data

Note:
ﬁgure
of
LDA
is
from
Wikipedia,
and
the
right
ﬁgure
is
from
paper
‘Web-‐Scale
Bayesian

Click-‐Through
Rate
PredicFon
for
Sponsored
Search
AdverFsing
in
MicrosoI’s
Bing
Search
Engine’

Bayesian
inference
methods

•  Exact
inference

–  Belief
propaga(on

•  Approximate
inference

–  Stochas(c
(sampling)

–  Determinis(c

•  Assumed
density
ﬁltering

•  Expecta(on
propaga(on

•  Varia(onal
Bayes

Message
passing

•  A
form
of
communica(on
used
in
mul(ple

domains
of
computer
science

–  Parallel
compu(ng
(MPI)

–  Object-‐oriented
programming

–  Inter-‐process
communica(on

–  Bayesian
inference

•  A
family
of
methods
to
infer
posterior
distribu(on

Expecta(on
Propaga(on

•  Belongs
to
message
passing
family

•  Approximated
method
(itera(on
is
needed)

•  Very
popular
in
Bayesian
inference,
especially

in
graphic
model

Researchers

•  Thomas
Minka

–  EP
was
proposed
in
PhD
thesis

•  Kevin
p.
Murphy

–  Machine
Learning
A
ProbabilisFc
PerspecFve

Background

• 
• 
• 
• 
• 
• 

(Truncated)
Gaussian

Exponen(al
family

Graphic
model

Factor
graph

Belief
propaga(on

Moment
matching

Gaussian
and
Truncated
Gaussian

•  Gaussian
opera(on
is
a
basis
for
EP
inference

–  Gaussian
+*/
Gaussian

–  Gaussian
integral

•  Truncated
Gaussian
is
used
in
many
EP

applica(ons

•  See
details
here

Exponen(al
family
distribu(on

•  Very
good
summary
in
Wikipedia

q(z) = h(z)g(η )exp{η T u(z)}

•  Suﬃcient
sta(s(cs
of
Gaussian
distribu(on:
(x,
x^2)

•  Typical
distribu(on

Note:
above
4
ﬁgures
are
from
Wikipedia

Graphical
Models

•  Directed
graph
(Bayesian
Network)

x1

x2

x4

K

P(x) = ∏ p(xk | pak )
k=1

x3

•  Undirected
graph
(Condi(onal

Random
Field)

x1

x2

x4

x3

Factor
graph

•  Express
rela(on
between
variable
nodes
explicitly

•  Rela(on
in
edge
-‐>
factor
node

•  Hide
the
diﬀerence
of
BN
and
CRF
in
inference

•  Make
inference
more
intui(onal

x1

x2

x4

x3

x1

fa

x2

fc

x4

c

x3

Belief
Propaga(on
Overview

•  Exact
Bayesian
method
to
infer
marginal

distribu(on

–  ‘sum-‐product’
message
passing

•  Key
components

–  Calculate
posterior
distribu(on
of
variable
node

–  Two
kinds
of
messages

Posterior
distribu(on
of
variable
node

•  Factor
graph

p(X) =

∏

Fs (s, X s ), for any variable x in the graph

s∈ne( x )

p(x) = ∑ p(X) = ∑
Xx

∏

Fs (s, X s ) =

X x s∈ne( x )

∏ ∑ F (x, X ) = ∏
s

s∈ne( x ) X s

in which µ fs −>x (x) = ∑ Fs (x, X s )
Xs

Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

s

s∈ne( x )

µ fs −>x (x)

Message:
factor
-‐>
variable
node

•  Factor
graph

µ fs −>x (x) = ∑ ...∑ fs (x, x1 ,..., x M )
x1

xM

∏

xm ∈ne( fs ) x

µ xm −> fs (xm ),

in which {x1 ,..., x M } is the set of variables on which the factor fs depends
Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

Message:
variable
-‐>
factor
node

•  Factor
graph

µ xm −> fs (xm ) =

∏

µ fl −>xm (xm )

l∈ne( xm ) fs

Summary:
posterior
distribuFon
is
only
determined
by
factors
!!

Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

Whole
steps
of
BP

•  Steps
to
calculate
posterior
distribu(on
of
given
variable

node

–  Step
1:
construct
factor
graph

–  Step
2:
treat
the
variable
node
as
root,
and
ini(alize
messages

sent
from
leaf
nodes

–  Step
3:
leverage
the
message
passing
steps
recursively
un(l
the

root
node
receives
messages
from
all
of
its
neighbors

–  Step
4:
get
the
marginal
distribu(on
by
mul(plying
all
messages

sent
in

Note:
the
ﬁgures
are
from
book
‘PaMern
recogniFon
and
machine
learning’

BP:
example

•  Infer
marginal
distribu(on
of
x_3

•  Infer
marginal
distribu(on
of
every
variables

Note:
the
ﬁgures
are
from
book
‘PaMern
recogniFon
and
machine
learning’

Posterior
is
intractable
some(mes

•  Example

–  Infer
the
mean
of
a
Gaussian
distribu(on

p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )

–  Ad
predictor

Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

Distribu(on
Approxima(on

Approximate p(x) with q(x), which belongs to exponential family
Such that: q(x) = h(x)g(η )exp{η T u(x)}
KL( p || q) = − ∫ p(x)In

q(x)
dx = − ∫ p(x)Inq(x)dx + ∫ p(x)Inp(x)dx
p(x)

= − ∫ p(x)Ing(η )dx − ∫ p(x)η T u(x) dx + const
= − Ing(η ) − η T Ε p( x ) [u(x)] + const
where const terms are independent of the natural parameter η

Minimize KL( p || q) by setting the gradient with repect to η to zero:
=> −∇Ing(η ) = Ε p( x ) [u(x)]
By leveraging formula (2.226) in PRML:
=> E q( x ) [u(x)] = −∇Ing(η ) = Ε p( x ) [u(x)]

Moment
matching

It's called moment matching when q(x) is Gaussian distribution
then u(x) = (x, x 2 )T
=> ∫ q(x)x dx = ∫ p(x)x dx, and ∫ q(x)x 2 dx = ∫ p(x)x 2 dx
=> meanq( x ) = ∫ q(x)x dx = ∫ p(x)x dx = mean p( x ) ,
variance q( x ) = ∫ q(x)x 2 dx − (meanq( x ) )2
= ∫ p(x)x 2 dx − (mean p( x ) )2 = variance p( x )

•  Moments
of
a
distribu(on

k'th moment M = ∫ x f (x)dx
b

k

a

k

EXPECTATION
PROPAGATION

=
Belief
Propaga(on
+
Moment
matching?

Key
Idea

•  Approximate
each
factor
with
Gaussian
distribu(on

•  Approximate
corresponding
factor
pairs
one
by
one?

•  Approximate
each
factor
in
turn
in
the
context
of
all

remaining
factors
(Proposed
by
Minka)

refine factor  (θ ) by ensuring q new (θ ) ∝  (θ )q j (θ ) is close with f j (θ )q j (θ )
fj
fj
q(θ )
in which q (θ ) = 
f j (θ )
j

EP:
The
detail
steps

1.Initialize all of the approximating factors i (θ )
f
2.Initialize the posterior approximation by setting : q(θ ) ∝ ∏ i (θ )
f
i

3.Until convergence :
(a). Choose a fator  (θ ) to refine.
fj
q(θ )
(b). Remove  (θ ) from the posterior by division : q j (θ ) = 
fj
f j (θ )
(c). Get the new posterior by settting sufficient statistics of q

new

f j (θ )q j (θ )
(θ ) equal to those of
zj

f j (θ )q j (θ ) new
1
(minimize KL(
|| q (θ ))),in which z j = ∫ f j (θ )q j (θ )dθ , and q new (θ ) = j (θ )q j (θ )
f
zj
k
new
 (θ ) :  (θ ) = k q (θ )
(d). Get the refined factor f j
fj
q j (θ )

Example:
The
cluEer
problem

•  Infer
the
mean
of
a
Gaussian
distribu(on

•  Want
to
try
MLE,
but

p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )

•  Approximate
with

q(θ ) = N(θ | m,vI ), and each factor  (θ ) = N(θ | mn ,vn I )
fn

–  Approximate
mixture
Gaussian
using
Gaussian

Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

Example:
The
cluEer
problem(2)

•  Approximate
complex
factor(e.g.
mixture

Gaussian)
with
Gaussian

fn (θ ) in blue,  (θ ) in red, and q n (θ ) in green
fn
Remember variance of q n (θ ) is usually very small, so  (θ ) only need to approximate fn (θ ) in small range
fn

Note:
above
2
ﬁgures
are
from
book
‘PaMern
recogniFon
and
machine
learning’

Applica(on:
Bayesian
CTR
predictor
for
Bing

•  See
the
details
here

–  Inference
step
by
step

–  Make
predic(on

•  Some
insights

–  Variance
of
each
feature
increases
aker
every

exposure

–  Sample
with
more
features
will
have
bigger
variance

•  Independent
assump(on
for
features

Experimenta(on

•  Dataset
is
very
Inhomogeneous

•  Performance

Model

FTRL

OWLQN

Ad
predictor

AUC

0.638

0.641

0.639

–  Other
metrics

•  Pros:
speed,
parameter
choice
cost,
online
learning
support,

interpreta(ve,
support
add
more
factors

•  Cons:
sparse

•  Code

Application: XBOX skill rating system
• 

See
details
in
P793~798
of
Machine
Learning
A
ProbabilisFc
PerspecFve

Note:
the
ﬁgure
is
from
paper:
‘TrueSkill:
A
Bayesian
Skill
RaFng
System’

Apply
to
all
Bayesian
models

•  Infer.net
(Microsok/Bishop)

–  A
framework
for
running
Bayesian
inference
in

graphical
models

–  Model-‐based
machine
learning

References

•  Books

–  Chapter
2/8/10
of
PaMern
RecogniFon
and
Machine
Learning

–  Chapter
22
of
Machine
Learning:
A
ProbabilisFc
PerspecFve

•  Papers

– 
– 
– 
– 

A
family
of
algorithms
for
approximate
Bayesian
inference

From
belief
propagaFon
to
expectaFon
propagaFon

TrueSkill:
A
Bayesian
Skill
RaFng
System

Web-‐Scale
Bayesian
Click-‐Through
Rate
PredicFon
for
Sponsored

Search
AdverFsing
in
MicrosoI’s
Bing
Search
Engine

•  Roadmap
for
EP

Expectation propagation

More Related Content

Similar to Expectation propagation

More from Dong Guo

Recently uploaded

Expectation propagation