Expectation propagation

Expecta(on
Propaga(on

Theory
and
Applica(on

Dong
Guo

Research
Workshop
2013
Hulu
Internal

See
more
details
in

hEp://dongguo.me/blog/2014/01/01/expecta(on-‐propaga(on/

hEp://dongguo.me/blog/2013/12/01/bayesian-‐ctr-‐predic(on-‐for-‐bing/

Outline

• 
• 
• 
• 

Overview

Background

Theory

Applica(ons

Bayesian
Paradigm

•  Infer
posterior
distribu(on

Prior

Posterior

Make
decision

Data

Note:
ﬁgure
of
LDA
is
from
Wikipedia,
and
the
right
ﬁgure
is
from
paper
‘Web-‐Scale
Bayesian

Click-‐Through
Rate
PredicFon
for
Sponsored
Search
AdverFsing
in
MicrosoI’s
Bing
Search
Engine’

Bayesian
inference
methods

•  Exact
inference

–  Belief
propaga(on

•  Approximate
inference

–  Stochas(c
(sampling)

–  Determinis(c

•  Assumed
density
ﬁltering

•  Expecta(on
propaga(on

•  Varia(onal
Bayes

Message
passing

•  A
form
of
communica(on
used
in
mul(ple

domains
of
computer
science

–  Parallel
compu(ng
(MPI)

–  Object-‐oriented
programming

–  Inter-‐process
communica(on

–  Bayesian
inference

•  A
family
of
methods
to
infer
posterior
distribu(on

Expecta(on
Propaga(on

•  Belongs
to
message
passing
family

•  Approximated
method
(itera(on
is
needed)

•  Very
popular
in
Bayesian
inference,
especially

in
graphic
model

Researchers

•  Thomas
Minka

–  EP
was
proposed
in
PhD
thesis

•  Kevin
p.
Murphy

–  Machine
Learning
A
ProbabilisFc
PerspecFve

Background

• 
• 
• 
• 
• 
• 

(Truncated)
Gaussian

Exponen(al
family

Graphic
model

Factor
graph

Belief
propaga(on

Moment
matching

Gaussian
and
Truncated
Gaussian

•  Gaussian
opera(on
is
a
basis
for
EP
inference

–  Gaussian
+*/
Gaussian

–  Gaussian
integral

•  Truncated
Gaussian
is
used
in
many
EP

applica(ons

•  See
details
here

Exponen(al
family
distribu(on

•  Very
good
summary
in
Wikipedia

q(z) = h(z)g(η )exp{η T u(z)}

•  Suﬃcient
sta(s(cs
of
Gaussian
distribu(on:
(x,
x^2)

•  Typical
distribu(on

Note:
above
4
ﬁgures
are
from
Wikipedia

Graphical
Models

•  Directed
graph
(Bayesian
Network)

x1

x2

x4

K

P(x) = ∏ p(xk | pak )
k=1

x3

•  Undirected
graph
(Condi(onal

Random
Field)

x1

x2

x4

x3

Factor
graph

•  Express
rela(on
between
variable
nodes
explicitly

•  Rela(on
in
edge
-‐>
factor
node

•  Hide
the
diﬀerence
of
BN
and
CRF
in
inference

•  Make
inference
more
intui(onal

x1

x2

x4

x3

x1

fa

x2

fc

x4

c

x3

Belief
Propaga(on
Overview

•  Exact
Bayesian
method
to
infer
marginal

distribu(on

–  ‘sum-‐product’
message
passing

•  Key
components

–  Calculate
posterior
distribu(on
of
variable
node

–  Two
kinds
of
messages

Posterior
distribu(on
of
variable
node

•  Factor
graph

p(X) =

∏

Fs (s, X s ), for any variable x in the graph

s∈ne( x )

p(x) = ∑ p(X) = ∑
Xx

∏

Fs (s, X s ) =

X x s∈ne( x )

∏ ∑ F (x, X ) = ∏
s

s∈ne( x ) X s

in which µ fs −>x (x) = ∑ Fs (x, X s )
Xs

Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

s

s∈ne( x )

µ fs −>x (x)

Message:
factor
-‐>
variable
node

•  Factor
graph

µ fs −>x (x) = ∑ ...∑ fs (x, x1 ,..., x M )
x1

xM

∏

xm ∈ne( fs ) x

µ xm −> fs (xm ),

in which {x1 ,..., x M } is the set of variables on which the factor fs depends
Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

Message:
variable
-‐>
factor
node

•  Factor
graph

µ xm −> fs (xm ) =

∏

µ fl −>xm (xm )

l∈ne( xm ) fs

Summary:
posterior
distribuFon
is
only
determined
by
factors
!!

Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

Whole
steps
of
BP

•  Steps
to
calculate
posterior
distribu(on
of
given
variable

node

–  Step
1:
construct
factor
graph

–  Step
2:
treat
the
variable
node
as
root,
and
ini(alize
messages

sent
from
leaf
nodes

–  Step
3:
leverage
the
message
passing
steps
recursively
un(l
the

root
node
receives
messages
from
all
of
its
neighbors

–  Step
4:
get
the
marginal
distribu(on
by
mul(plying
all
messages

sent
in

Note:
the
ﬁgures
are
from
book
‘PaMern
recogniFon
and
machine
learning’

BP:
example

•  Infer
marginal
distribu(on
of
x_3

•  Infer
marginal
distribu(on
of
every
variables

Note:
the
ﬁgures
are
from
book
‘PaMern
recogniFon
and
machine
learning’

Posterior
is
intractable
some(mes

•  Example

–  Infer
the
mean
of
a
Gaussian
distribu(on

p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )

–  Ad
predictor

Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

Distribu(on
Approxima(on

Approximate p(x) with q(x), which belongs to exponential family
Such that: q(x) = h(x)g(η )exp{η T u(x)}
KL( p || q) = − ∫ p(x)In

q(x)
dx = − ∫ p(x)Inq(x)dx + ∫ p(x)Inp(x)dx
p(x)

= − ∫ p(x)Ing(η )dx − ∫ p(x)η T u(x) dx + const
= − Ing(η ) − η T Ε p( x ) [u(x)] + const
where const terms are independent of the natural parameter η

Minimize KL( p || q) by setting the gradient with repect to η to zero:
=> −∇Ing(η ) = Ε p( x ) [u(x)]
By leveraging formula (2.226) in PRML:
=> E q( x ) [u(x)] = −∇Ing(η ) = Ε p( x ) [u(x)]

Moment
matching

It's called moment matching when q(x) is Gaussian distribution
then u(x) = (x, x 2 )T
=> ∫ q(x)x dx = ∫ p(x)x dx, and ∫ q(x)x 2 dx = ∫ p(x)x 2 dx
=> meanq( x ) = ∫ q(x)x dx = ∫ p(x)x dx = mean p( x ) ,
variance q( x ) = ∫ q(x)x 2 dx − (meanq( x ) )2
= ∫ p(x)x 2 dx − (mean p( x ) )2 = variance p( x )

•  Moments
of
a
distribu(on

k'th moment M = ∫ x f (x)dx
b

k

a

k

EXPECTATION
PROPAGATION

=
Belief
Propaga(on
+
Moment
matching?

Key
Idea

•  Approximate
each
factor
with
Gaussian
distribu(on

•  Approximate
corresponding
factor
pairs
one
by
one?

•  Approximate
each
factor
in
turn
in
the
context
of
all

remaining
factors
(Proposed
by
Minka)

refine factor  (θ ) by ensuring q new (θ ) ∝  (θ )q j (θ ) is close with f j (θ )q j (θ )
fj
fj
q(θ )
in which q (θ ) = 
f j (θ )
j

EP:
The
detail
steps

1.Initialize all of the approximating factors i (θ )
f
2.Initialize the posterior approximation by setting : q(θ ) ∝ ∏ i (θ )
f
i

3.Until convergence :
(a). Choose a fator  (θ ) to refine.
fj
q(θ )
(b). Remove  (θ ) from the posterior by division : q j (θ ) = 
fj
f j (θ )
(c). Get the new posterior by settting sufficient statistics of q

new

f j (θ )q j (θ )
(θ ) equal to those of
zj

f j (θ )q j (θ ) new
1
(minimize KL(
|| q (θ ))),in which z j = ∫ f j (θ )q j (θ )dθ , and q new (θ ) = j (θ )q j (θ )
f
zj
k
new
 (θ ) :  (θ ) = k q (θ )
(d). Get the refined factor f j
fj
q j (θ )

Example:
The
cluEer
problem

•  Infer
the
mean
of
a
Gaussian
distribu(on

•  Want
to
try
MLE,
but

p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )

•  Approximate
with

q(θ ) = N(θ | m,vI ), and each factor  (θ ) = N(θ | mn ,vn I )
fn

–  Approximate
mixture
Gaussian
using
Gaussian

Note:
the
ﬁgure
is
from
book
‘PaMern
recogniFon
and
machine
learning’

Example:
The
cluEer
problem(2)

•  Approximate
complex
factor(e.g.
mixture

Gaussian)
with
Gaussian

fn (θ ) in blue,  (θ ) in red, and q n (θ ) in green
fn
Remember variance of q n (θ ) is usually very small, so  (θ ) only need to approximate fn (θ ) in small range
fn

Note:
above
2
ﬁgures
are
from
book
‘PaMern
recogniFon
and
machine
learning’

Applica(on:
Bayesian
CTR
predictor
for
Bing

•  See
the
details
here

–  Inference
step
by
step

–  Make
predic(on

•  Some
insights

–  Variance
of
each
feature
increases
aker
every

exposure

–  Sample
with
more
features
will
have
bigger
variance

•  Independent
assump(on
for
features

Experimenta(on

•  Dataset
is
very
Inhomogeneous

•  Performance

Model

FTRL

OWLQN

Ad
predictor

AUC

0.638

0.641

0.639

–  Other
metrics

•  Pros:
speed,
parameter
choice
cost,
online
learning
support,

interpreta(ve,
support
add
more
factors

•  Cons:
sparse

•  Code

Application: XBOX skill rating system
• 

See
details
in
P793~798
of
Machine
Learning
A
ProbabilisFc
PerspecFve

Note:
the
ﬁgure
is
from
paper:
‘TrueSkill:
A
Bayesian
Skill
RaFng
System’

Apply
to
all
Bayesian
models

•  Infer.net
(Microsok/Bishop)

–  A
framework
for
running
Bayesian
inference
in

graphical
models

–  Model-‐based
machine
learning

References

•  Books

–  Chapter
2/8/10
of
PaMern
RecogniFon
and
Machine
Learning

–  Chapter
22
of
Machine
Learning:
A
ProbabilisFc
PerspecFve

•  Papers

– 
– 
– 
– 

A
family
of
algorithms
for
approximate
Bayesian
inference

From
belief
propagaFon
to
expectaFon
propagaFon

TrueSkill:
A
Bayesian
Skill
RaFng
System

Web-‐Scale
Bayesian
Click-‐Through
Rate
PredicFon
for
Sponsored

Search
AdverFsing
in
MicrosoI’s
Bing
Search
Engine

•  Roadmap
for
EP

Expectation propagation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Expectation propagation

Similar to Expectation propagation (20)

More from Dong Guo

More from Dong Guo (8)

Recently uploaded

Recently uploaded (20)

Expectation propagation