Expecta(on	
  Propaga(on	
  
Theory	
  and	
  Applica(on	
  

Dong	
  Guo	
  
Research	
  Workshop	
  2013	
  Hulu	
  Internal	
  
	
  
See	
  more	
  details	
  in	
  
hEp://dongguo.me/blog/2014/01/01/expecta(on-­‐propaga(on/	
  
hEp://dongguo.me/blog/2013/12/01/bayesian-­‐ctr-­‐predic(on-­‐for-­‐bing/	
  
	
  
	
  
Outline	
  
• 
• 
• 
• 

Overview	
  
Background	
  
Theory	
  
Applica(ons	
  
OVERVIEW	
  
Bayesian	
  Paradigm	
  
•  Infer	
  posterior	
  distribu(on	
  
Prior	
  
Posterior	
  

Make	
  decision	
  

Data	
  

Note:	
  figure	
  of	
  LDA	
  is	
  from	
  Wikipedia,	
  and	
  the	
  right	
  figure	
  is	
  from	
  paper	
  ‘Web-­‐Scale	
  Bayesian	
  
Click-­‐Through	
  Rate	
  PredicFon	
  for	
  Sponsored	
  Search	
  AdverFsing	
  in	
  MicrosoI’s	
  Bing	
  Search	
  Engine’	
  	
  
Bayesian	
  inference	
  methods	
  
•  Exact	
  inference	
  
–  Belief	
  propaga(on	
  

•  Approximate	
  inference	
  
–  Stochas(c	
  (sampling)	
  
–  Determinis(c	
  
•  Assumed	
  density	
  filtering	
  
•  Expecta(on	
  propaga(on	
  
•  Varia(onal	
  Bayes	
  
Message	
  passing	
  
•  A	
  form	
  of	
  communica(on	
  used	
  in	
  mul(ple	
  
domains	
  of	
  computer	
  science	
  
–  Parallel	
  compu(ng	
  (MPI)	
  
–  Object-­‐oriented	
  programming	
  
–  Inter-­‐process	
  communica(on	
  
–  Bayesian	
  inference	
  
•  A	
  family	
  of	
  methods	
  to	
  infer	
  posterior	
  distribu(on	
  
Expecta(on	
  Propaga(on	
  
•  Belongs	
  to	
  message	
  passing	
  family	
  
•  Approximated	
  method	
  (itera(on	
  is	
  needed)	
  
	
  

•  Very	
  popular	
  in	
  Bayesian	
  inference,	
  especially	
  
in	
  graphic	
  model	
  
Researchers	
  
•  Thomas	
  Minka	
  
–  EP	
  was	
  proposed	
  in	
  PhD	
  thesis	
  

•  Kevin	
  p.	
  Murphy	
  
–  Machine	
  Learning	
  A	
  ProbabilisFc	
  PerspecFve	
  
BACKGROUND	
  
Background	
  
• 
• 
• 
• 
• 
• 

(Truncated)	
  Gaussian	
  
Exponen(al	
  family	
  
Graphic	
  model	
  
Factor	
  graph	
  
Belief	
  propaga(on	
  
Moment	
  matching	
  
Gaussian	
  and	
  Truncated	
  Gaussian	
  
•  Gaussian	
  opera(on	
  is	
  a	
  basis	
  for	
  EP	
  inference	
  
–  Gaussian	
  +*/	
  Gaussian	
  
–  Gaussian	
  integral	
  

•  Truncated	
  Gaussian	
  is	
  used	
  in	
  many	
  EP	
  
applica(ons	
  
•  See	
  details	
  here	
  
Exponen(al	
  family	
  distribu(on	
  
•  Very	
  good	
  summary	
  in	
  Wikipedia	
  

q(z) = h(z)g(η )exp{η T u(z)}
	
  	
  

•  Sufficient	
  sta(s(cs	
  of	
  Gaussian	
  distribu(on:	
  (x,	
  x^2)	
  
•  Typical	
  distribu(on	
  

Note:	
  above	
  4	
  figures	
  are	
  from	
  Wikipedia	
  
Graphical	
  Models	
  
•  Directed	
  graph	
  (Bayesian	
  Network)	
  

x1	
  

x2	
  

x4	
  
K

P(x) = ∏ p(xk | pak )
k=1

x3	
  

•  Undirected	
  graph	
  (Condi(onal	
  
Random	
  Field)	
  

x1	
  

x2	
  

x4	
  

x3	
  
Factor	
  graph	
  
•  Express	
  rela(on	
  between	
  variable	
  nodes	
  explicitly	
  
•  Rela(on	
  in	
  edge	
  -­‐>	
  factor	
  node	
  

•  Hide	
  the	
  difference	
  of	
  BN	
  and	
  CRF	
  in	
  inference	
  
•  Make	
  inference	
  more	
  intui(onal	
  
x1	
  

x2	
  

x4	
  

x3	
  

x1	
  

fa	
  

x2	
  
fc	
  

x4	
  

c	
  

x3	
  
BELIEF	
  PROPAGATION	
  
Belief	
  Propaga(on	
  Overview	
  
•  Exact	
  Bayesian	
  method	
  to	
  infer	
  marginal	
  
distribu(on	
  
–  ‘sum-­‐product’	
  message	
  passing	
  

•  Key	
  components	
  
–  Calculate	
  posterior	
  distribu(on	
  of	
  variable	
  node	
  
–  Two	
  kinds	
  of	
  messages	
  
Posterior	
  distribu(on	
  of	
  variable	
  node	
  
•  Factor	
  graph	
  

p(X) =

∏

Fs (s, X s ), for any variable x in the graph

s∈ne( x )

p(x) = ∑ p(X) = ∑
Xx

∏

Fs (s, X s ) =

X  x s∈ne( x )

∏ ∑ F (x, X ) = ∏
s

s∈ne( x ) X s

in which µ fs −>x (x) = ∑ Fs (x, X s )
Xs

Note:	
  the	
  figure	
  is	
  from	
  book	
  ‘PaMern	
  recogniFon	
  and	
  machine	
  learning’	
  

s

s∈ne( x )

µ fs −>x (x)
Message:	
  factor	
  -­‐>	
  variable	
  node	
  
•  Factor	
  graph	
  

µ fs −>x (x) = ∑ ...∑ fs (x, x1 ,..., x M )
x1

xM

∏

xm ∈ne( fs ) x

µ xm −> fs (xm ),

in which {x1 ,..., x M } is the set of variables on which the factor fs depends
Note:	
  the	
  figure	
  is	
  from	
  book	
  ‘PaMern	
  recogniFon	
  and	
  machine	
  learning’	
  
Message:	
  variable	
  -­‐>	
  factor	
  node	
  
•  Factor	
  graph	
  

µ xm −> fs (xm ) =

∏

µ fl −>xm (xm )

l∈ne( xm ) fs

Summary:	
  posterior	
  distribuFon	
  is	
  only	
  determined	
  by	
  factors	
  !!	
  	
  
Note:	
  the	
  figure	
  is	
  from	
  book	
  ‘PaMern	
  recogniFon	
  and	
  machine	
  learning’	
  
Whole	
  steps	
  of	
  BP	
  
•  Steps	
  to	
  calculate	
  posterior	
  distribu(on	
  of	
  given	
  variable	
  
node	
  
–  Step	
  1:	
  construct	
  factor	
  graph	
  
–  Step	
  2:	
  treat	
  the	
  variable	
  node	
  as	
  root,	
  and	
  ini(alize	
  messages	
  
sent	
  from	
  leaf	
  nodes	
  

–  Step	
  3:	
  leverage	
  the	
  message	
  passing	
  steps	
  recursively	
  un(l	
  the	
  
root	
  node	
  receives	
  messages	
  from	
  all	
  of	
  its	
  neighbors	
  
–  Step	
  4:	
  get	
  the	
  marginal	
  distribu(on	
  by	
  mul(plying	
  all	
  messages	
  
sent	
  in	
  
Note:	
  the	
  figures	
  are	
  from	
  book	
  ‘PaMern	
  recogniFon	
  and	
  machine	
  learning’	
  
BP:	
  example	
  
•  Infer	
  marginal	
  distribu(on	
  of	
  x_3	
  

•  Infer	
  marginal	
  distribu(on	
  of	
  every	
  variables	
  

Note:	
  the	
  figures	
  are	
  from	
  book	
  ‘PaMern	
  recogniFon	
  and	
  machine	
  learning’	
  
Posterior	
  is	
  intractable	
  some(mes	
  
•  Example	
  
–  Infer	
  the	
  mean	
  of	
  a	
  Gaussian	
  distribu(on	
  
p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )

–  Ad	
  predictor	
  

Note:	
  the	
  figure	
  is	
  from	
  book	
  ‘PaMern	
  recogniFon	
  and	
  machine	
  learning’	
  
Distribu(on	
  Approxima(on	
  
Approximate p(x) with q(x), which belongs to exponential family
Such that: q(x) = h(x)g(η )exp{η T u(x)}
KL( p || q) = − ∫ p(x)In

q(x)
dx = − ∫ p(x)Inq(x)dx + ∫ p(x)Inp(x)dx
p(x)

= − ∫ p(x)Ing(η )dx − ∫ p(x)η T u(x) dx + const
= − Ing(η ) − η T Ε p( x ) [u(x)] + const
where const terms are independent of the natural parameter η

Minimize KL( p || q) by setting the gradient with repect to η to zero:
=> −∇Ing(η ) = Ε p( x ) [u(x)]
By leveraging formula (2.226) in PRML:
=> E q( x ) [u(x)] = −∇Ing(η ) = Ε p( x ) [u(x)]
Moment	
  matching	
  
It's called moment matching when q(x) is Gaussian distribution
then u(x) = (x, x 2 )T
=> ∫ q(x)x dx = ∫ p(x)x dx, and ∫ q(x)x 2 dx = ∫ p(x)x 2 dx
=> meanq( x ) = ∫ q(x)x dx = ∫ p(x)x dx = mean p( x ) ,
variance q( x ) = ∫ q(x)x 2 dx − (meanq( x ) )2
= ∫ p(x)x 2 dx − (mean p( x ) )2 = variance p( x )

•  Moments	
  of	
  a	
  distribu(on	
  
k'th moment M = ∫ x f (x)dx
b

k

a

k
EXPECTATION	
  PROPAGATION	
  
=	
  Belief	
  Propaga(on	
  +	
  Moment	
  matching?	
  
Key	
  Idea	
  
•  Approximate	
  each	
  factor	
  with	
  Gaussian	
  distribu(on	
  
•  Approximate	
  corresponding	
  factor	
  pairs	
  one	
  by	
  one?	
  
•  Approximate	
  each	
  factor	
  in	
  turn	
  in	
  the	
  context	
  of	
  all	
  
remaining	
  factors	
  (Proposed	
  by	
  Minka)	
  
refine factor  (θ ) by ensuring q new (θ ) ∝  (θ )q  j (θ ) is close with f j (θ )q  j (θ )
fj
fj
q(θ )
in which q (θ ) = 
f j (θ )
j
EP:	
  The	
  detail	
  steps	
  
	
  	
  

1.Initialize all of the approximating factors i (θ )
f
2.Initialize the posterior approximation by setting : q(θ ) ∝ ∏ i (θ )
f
i

3.Until convergence :
(a). Choose a fator  (θ ) to refine.
fj
q(θ )
(b). Remove  (θ ) from the posterior by division : q  j (θ ) = 
fj
f j (θ )
(c). Get the new posterior by settting sufficient statistics of q

new

f j (θ )q  j (θ )
(θ ) equal to those of
zj

f j (θ )q  j (θ ) new
1
(minimize KL(
|| q (θ ))),in which z j = ∫ f j (θ )q  j (θ )dθ , and q new (θ ) = j (θ )q  j (θ )
f
zj
k
new
 (θ ) :  (θ ) = k q (θ )
(d). Get the refined factor f j
fj
q  j (θ )
Example:	
  The	
  cluEer	
  problem	
  
•  Infer	
  the	
  mean	
  of	
  a	
  Gaussian	
  distribu(on	
  
•  Want	
  to	
  try	
  MLE,	
  but	
  
p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI )
p(θ ) = N(θ | 0,bI )

•  Approximate	
  with	
  
q(θ ) = N(θ | m,vI ), and each factor  (θ ) = N(θ | mn ,vn I )
fn

–  Approximate	
  mixture	
  Gaussian	
  using	
  Gaussian	
  

Note:	
  the	
  figure	
  is	
  from	
  book	
  ‘PaMern	
  recogniFon	
  and	
  machine	
  learning’	
  
Example:	
  The	
  cluEer	
  problem(2)	
  
•  Approximate	
  complex	
  factor(e.g.	
  mixture	
  
Gaussian)	
  with	
  Gaussian	
  

fn (θ ) in blue,  (θ ) in red, and q  n (θ ) in green
fn
Remember variance of q  n (θ ) is usually very small, so  (θ ) only need to approximate fn (θ ) in small range
fn

Note:	
  above	
  2	
  figures	
  are	
  from	
  book	
  ‘PaMern	
  recogniFon	
  and	
  machine	
  learning’	
  
Applica(on:	
  Bayesian	
  CTR	
  predictor	
  for	
  Bing	
  
•  See	
  the	
  details	
  here	
  
–  Inference	
  step	
  by	
  step	
  
–  Make	
  predic(on	
  

•  Some	
  insights	
  
–  Variance	
  of	
  each	
  feature	
  increases	
  aker	
  every	
  
exposure	
  
–  Sample	
  with	
  more	
  features	
  will	
  have	
  bigger	
  variance	
  
•  Independent	
  assump(on	
  for	
  features	
  
Experimenta(on	
  
•  Dataset	
  is	
  very	
  Inhomogeneous	
  
	
  

•  Performance	
  
	
  

Model	
  

FTRL	
  

OWLQN	
  

Ad	
  predictor	
  

AUC	
  

0.638	
  

0.641	
  

0.639	
  

	
  
–  Other	
  metrics	
  

•  Pros:	
  speed,	
  parameter	
  choice	
  cost,	
  online	
  learning	
  support,	
  
interpreta(ve,	
  support	
  add	
  more	
  factors	
  
•  Cons:	
  sparse	
  

•  Code	
  
Application: XBOX skill rating system
•  	
  	
  

See	
  details	
  in	
  P793~798	
  of	
  Machine	
  Learning	
  A	
  ProbabilisFc	
  PerspecFve	
  	
  
	
  
Note:	
  the	
  figure	
  is	
  from	
  paper:	
  ‘TrueSkill:	
  A	
  Bayesian	
  Skill	
  RaFng	
  System’	
  	
  
Apply	
  to	
  all	
  Bayesian	
  models	
  
•  Infer.net	
  (Microsok/Bishop)	
  
–  A	
  framework	
  for	
  running	
  Bayesian	
  inference	
  in	
  
graphical	
  models	
  	
  
–  Model-­‐based	
  machine	
  learning	
  	
  
References	
  
•  Books	
  
–  Chapter	
  2/8/10	
  of	
  PaMern	
  RecogniFon	
  and	
  Machine	
  Learning	
  
–  Chapter	
  22	
  of	
  Machine	
  Learning:	
  A	
  ProbabilisFc	
  PerspecFve	
  

•  Papers	
  
– 
– 
– 
– 

A	
  family	
  of	
  algorithms	
  for	
  approximate	
  Bayesian	
  inference	
  
From	
  belief	
  propagaFon	
  to	
  expectaFon	
  propagaFon	
  
TrueSkill:	
  A	
  Bayesian	
  Skill	
  RaFng	
  System	
  
Web-­‐Scale	
  Bayesian	
  Click-­‐Through	
  Rate	
  PredicFon	
  for	
  Sponsored	
  
Search	
  AdverFsing	
  in	
  MicrosoI’s	
  Bing	
  Search	
  Engine	
  

•  Roadmap	
  for	
  EP	
  

Expectation propagation

  • 1.
    Expecta(on  Propaga(on   Theory  and  Applica(on   Dong  Guo   Research  Workshop  2013  Hulu  Internal     See  more  details  in   hEp://dongguo.me/blog/2014/01/01/expecta(on-­‐propaga(on/   hEp://dongguo.me/blog/2013/12/01/bayesian-­‐ctr-­‐predic(on-­‐for-­‐bing/      
  • 2.
  • 3.
  • 4.
    Bayesian  Paradigm   • Infer  posterior  distribu(on   Prior   Posterior   Make  decision   Data   Note:  figure  of  LDA  is  from  Wikipedia,  and  the  right  figure  is  from  paper  ‘Web-­‐Scale  Bayesian   Click-­‐Through  Rate  PredicFon  for  Sponsored  Search  AdverFsing  in  MicrosoI’s  Bing  Search  Engine’    
  • 5.
    Bayesian  inference  methods   •  Exact  inference   –  Belief  propaga(on   •  Approximate  inference   –  Stochas(c  (sampling)   –  Determinis(c   •  Assumed  density  filtering   •  Expecta(on  propaga(on   •  Varia(onal  Bayes  
  • 6.
    Message  passing   • A  form  of  communica(on  used  in  mul(ple   domains  of  computer  science   –  Parallel  compu(ng  (MPI)   –  Object-­‐oriented  programming   –  Inter-­‐process  communica(on   –  Bayesian  inference   •  A  family  of  methods  to  infer  posterior  distribu(on  
  • 7.
    Expecta(on  Propaga(on   • Belongs  to  message  passing  family   •  Approximated  method  (itera(on  is  needed)     •  Very  popular  in  Bayesian  inference,  especially   in  graphic  model  
  • 8.
    Researchers   •  Thomas  Minka   –  EP  was  proposed  in  PhD  thesis   •  Kevin  p.  Murphy   –  Machine  Learning  A  ProbabilisFc  PerspecFve  
  • 9.
  • 10.
    Background   •  •  •  •  •  •  (Truncated)  Gaussian   Exponen(al  family   Graphic  model   Factor  graph   Belief  propaga(on   Moment  matching  
  • 11.
    Gaussian  and  Truncated  Gaussian   •  Gaussian  opera(on  is  a  basis  for  EP  inference   –  Gaussian  +*/  Gaussian   –  Gaussian  integral   •  Truncated  Gaussian  is  used  in  many  EP   applica(ons   •  See  details  here  
  • 12.
    Exponen(al  family  distribu(on   •  Very  good  summary  in  Wikipedia   q(z) = h(z)g(η )exp{η T u(z)}     •  Sufficient  sta(s(cs  of  Gaussian  distribu(on:  (x,  x^2)   •  Typical  distribu(on   Note:  above  4  figures  are  from  Wikipedia  
  • 13.
    Graphical  Models   • Directed  graph  (Bayesian  Network)   x1   x2   x4   K P(x) = ∏ p(xk | pak ) k=1 x3   •  Undirected  graph  (Condi(onal   Random  Field)   x1   x2   x4   x3  
  • 14.
    Factor  graph   • Express  rela(on  between  variable  nodes  explicitly   •  Rela(on  in  edge  -­‐>  factor  node   •  Hide  the  difference  of  BN  and  CRF  in  inference   •  Make  inference  more  intui(onal   x1   x2   x4   x3   x1   fa   x2   fc   x4   c   x3  
  • 15.
  • 16.
    Belief  Propaga(on  Overview   •  Exact  Bayesian  method  to  infer  marginal   distribu(on   –  ‘sum-­‐product’  message  passing   •  Key  components   –  Calculate  posterior  distribu(on  of  variable  node   –  Two  kinds  of  messages  
  • 17.
    Posterior  distribu(on  of  variable  node   •  Factor  graph   p(X) = ∏ Fs (s, X s ), for any variable x in the graph s∈ne( x ) p(x) = ∑ p(X) = ∑ Xx ∏ Fs (s, X s ) = X x s∈ne( x ) ∏ ∑ F (x, X ) = ∏ s s∈ne( x ) X s in which µ fs −>x (x) = ∑ Fs (x, X s ) Xs Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’   s s∈ne( x ) µ fs −>x (x)
  • 18.
    Message:  factor  -­‐>  variable  node   •  Factor  graph   µ fs −>x (x) = ∑ ...∑ fs (x, x1 ,..., x M ) x1 xM ∏ xm ∈ne( fs ) x µ xm −> fs (xm ), in which {x1 ,..., x M } is the set of variables on which the factor fs depends Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  • 19.
    Message:  variable  -­‐>  factor  node   •  Factor  graph   µ xm −> fs (xm ) = ∏ µ fl −>xm (xm ) l∈ne( xm ) fs Summary:  posterior  distribuFon  is  only  determined  by  factors  !!     Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  • 20.
    Whole  steps  of  BP   •  Steps  to  calculate  posterior  distribu(on  of  given  variable   node   –  Step  1:  construct  factor  graph   –  Step  2:  treat  the  variable  node  as  root,  and  ini(alize  messages   sent  from  leaf  nodes   –  Step  3:  leverage  the  message  passing  steps  recursively  un(l  the   root  node  receives  messages  from  all  of  its  neighbors   –  Step  4:  get  the  marginal  distribu(on  by  mul(plying  all  messages   sent  in   Note:  the  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  • 21.
    BP:  example   • Infer  marginal  distribu(on  of  x_3   •  Infer  marginal  distribu(on  of  every  variables   Note:  the  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  • 22.
    Posterior  is  intractable  some(mes   •  Example   –  Infer  the  mean  of  a  Gaussian  distribu(on   p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI ) p(θ ) = N(θ | 0,bI ) –  Ad  predictor   Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  • 23.
    Distribu(on  Approxima(on   Approximatep(x) with q(x), which belongs to exponential family Such that: q(x) = h(x)g(η )exp{η T u(x)} KL( p || q) = − ∫ p(x)In q(x) dx = − ∫ p(x)Inq(x)dx + ∫ p(x)Inp(x)dx p(x) = − ∫ p(x)Ing(η )dx − ∫ p(x)η T u(x) dx + const = − Ing(η ) − η T Ε p( x ) [u(x)] + const where const terms are independent of the natural parameter η Minimize KL( p || q) by setting the gradient with repect to η to zero: => −∇Ing(η ) = Ε p( x ) [u(x)] By leveraging formula (2.226) in PRML: => E q( x ) [u(x)] = −∇Ing(η ) = Ε p( x ) [u(x)]
  • 24.
    Moment  matching   It'scalled moment matching when q(x) is Gaussian distribution then u(x) = (x, x 2 )T => ∫ q(x)x dx = ∫ p(x)x dx, and ∫ q(x)x 2 dx = ∫ p(x)x 2 dx => meanq( x ) = ∫ q(x)x dx = ∫ p(x)x dx = mean p( x ) , variance q( x ) = ∫ q(x)x 2 dx − (meanq( x ) )2 = ∫ p(x)x 2 dx − (mean p( x ) )2 = variance p( x ) •  Moments  of  a  distribu(on   k'th moment M = ∫ x f (x)dx b k a k
  • 25.
    EXPECTATION  PROPAGATION   =  Belief  Propaga(on  +  Moment  matching?  
  • 26.
    Key  Idea   • Approximate  each  factor  with  Gaussian  distribu(on   •  Approximate  corresponding  factor  pairs  one  by  one?   •  Approximate  each  factor  in  turn  in  the  context  of  all   remaining  factors  (Proposed  by  Minka)   refine factor  (θ ) by ensuring q new (θ ) ∝  (θ )q j (θ ) is close with f j (θ )q j (θ ) fj fj q(θ ) in which q (θ ) =  f j (θ ) j
  • 27.
    EP:  The  detail  steps       1.Initialize all of the approximating factors i (θ ) f 2.Initialize the posterior approximation by setting : q(θ ) ∝ ∏ i (θ ) f i 3.Until convergence : (a). Choose a fator  (θ ) to refine. fj q(θ ) (b). Remove  (θ ) from the posterior by division : q j (θ ) =  fj f j (θ ) (c). Get the new posterior by settting sufficient statistics of q new f j (θ )q j (θ ) (θ ) equal to those of zj f j (θ )q j (θ ) new 1 (minimize KL( || q (θ ))),in which z j = ∫ f j (θ )q j (θ )dθ , and q new (θ ) = j (θ )q j (θ ) f zj k new  (θ ) :  (θ ) = k q (θ ) (d). Get the refined factor f j fj q j (θ )
  • 28.
    Example:  The  cluEer  problem   •  Infer  the  mean  of  a  Gaussian  distribu(on   •  Want  to  try  MLE,  but   p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI ) p(θ ) = N(θ | 0,bI ) •  Approximate  with   q(θ ) = N(θ | m,vI ), and each factor  (θ ) = N(θ | mn ,vn I ) fn –  Approximate  mixture  Gaussian  using  Gaussian   Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  • 29.
    Example:  The  cluEer  problem(2)   •  Approximate  complex  factor(e.g.  mixture   Gaussian)  with  Gaussian   fn (θ ) in blue,  (θ ) in red, and q n (θ ) in green fn Remember variance of q n (θ ) is usually very small, so  (θ ) only need to approximate fn (θ ) in small range fn Note:  above  2  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  • 30.
    Applica(on:  Bayesian  CTR  predictor  for  Bing   •  See  the  details  here   –  Inference  step  by  step   –  Make  predic(on   •  Some  insights   –  Variance  of  each  feature  increases  aker  every   exposure   –  Sample  with  more  features  will  have  bigger  variance   •  Independent  assump(on  for  features  
  • 31.
    Experimenta(on   •  Dataset  is  very  Inhomogeneous     •  Performance     Model   FTRL   OWLQN   Ad  predictor   AUC   0.638   0.641   0.639     –  Other  metrics   •  Pros:  speed,  parameter  choice  cost,  online  learning  support,   interpreta(ve,  support  add  more  factors   •  Cons:  sparse   •  Code  
  • 32.
    Application: XBOX skillrating system •      See  details  in  P793~798  of  Machine  Learning  A  ProbabilisFc  PerspecFve       Note:  the  figure  is  from  paper:  ‘TrueSkill:  A  Bayesian  Skill  RaFng  System’    
  • 33.
    Apply  to  all  Bayesian  models   •  Infer.net  (Microsok/Bishop)   –  A  framework  for  running  Bayesian  inference  in   graphical  models     –  Model-­‐based  machine  learning    
  • 34.
    References   •  Books   –  Chapter  2/8/10  of  PaMern  RecogniFon  and  Machine  Learning   –  Chapter  22  of  Machine  Learning:  A  ProbabilisFc  PerspecFve   •  Papers   –  –  –  –  A  family  of  algorithms  for  approximate  Bayesian  inference   From  belief  propagaFon  to  expectaFon  propagaFon   TrueSkill:  A  Bayesian  Skill  RaFng  System   Web-­‐Scale  Bayesian  Click-­‐Through  Rate  PredicFon  for  Sponsored   Search  AdverFsing  in  MicrosoI’s  Bing  Search  Engine   •  Roadmap  for  EP