Expecta(on	
  Propaga(on	
  
Theory	
  and	
  Applica(on	
  

Dong	
  Guo	
  
Research	
  Workshop	
  2013	
  Hulu	
  Inte...
Outline	
  
• 
• 
• 
• 

Overview	
  
Background	
  
Theory	
  
Applica(ons	
  
OVERVIEW	
  
Bayesian	
  Paradigm	
  
•  Infer	
  posterior	
  distribu(on	
  
Prior	
  
Posterior	
  

Make	
  decision	
  

Data	
  
...
Bayesian	
  inference	
  methods	
  
•  Exact	
  inference	
  
–  Belief	
  propaga(on	
  

•  Approximate	
  inference	
 ...
Message	
  passing	
  
•  A	
  form	
  of	
  communica(on	
  used	
  in	
  mul(ple	
  
domains	
  of	
  computer	
  scienc...
Expecta(on	
  Propaga(on	
  
•  Belongs	
  to	
  message	
  passing	
  family	
  
•  Approximated	
  method	
  (itera(on	
...
Researchers	
  
•  Thomas	
  Minka	
  
–  EP	
  was	
  proposed	
  in	
  PhD	
  thesis	
  

•  Kevin	
  p.	
  Murphy	
  
–...
BACKGROUND	
  
Background	
  
• 
• 
• 
• 
• 
• 

(Truncated)	
  Gaussian	
  
Exponen(al	
  family	
  
Graphic	
  model	
  
Factor	
  grap...
Gaussian	
  and	
  Truncated	
  Gaussian	
  
•  Gaussian	
  opera(on	
  is	
  a	
  basis	
  for	
  EP	
  inference	
  
–  ...
Exponen(al	
  family	
  distribu(on	
  
•  Very	
  good	
  summary	
  in	
  Wikipedia	
  

q(z) = h(z)g(η )exp{η T u(z)}
	...
Graphical	
  Models	
  
•  Directed	
  graph	
  (Bayesian	
  Network)	
  

x1	
  

x2	
  

x4	
  
K

P(x) = ∏ p(xk | pak )...
Factor	
  graph	
  
•  Express	
  rela(on	
  between	
  variable	
  nodes	
  explicitly	
  
•  Rela(on	
  in	
  edge	
  -­...
BELIEF	
  PROPAGATION	
  
Belief	
  Propaga(on	
  Overview	
  
•  Exact	
  Bayesian	
  method	
  to	
  infer	
  marginal	
  
distribu(on	
  
–  ‘sum...
Posterior	
  distribu(on	
  of	
  variable	
  node	
  
•  Factor	
  graph	
  

p(X) =

∏

Fs (s, X s ), for any variable x...
Message:	
  factor	
  -­‐>	
  variable	
  node	
  
•  Factor	
  graph	
  

µ fs −>x (x) = ∑ ...∑ fs (x, x1 ,..., x M )
x1
...
Message:	
  variable	
  -­‐>	
  factor	
  node	
  
•  Factor	
  graph	
  

µ xm −> fs (xm ) =

∏

µ fl −>xm (xm )

l∈ne( x...
Whole	
  steps	
  of	
  BP	
  
•  Steps	
  to	
  calculate	
  posterior	
  distribu(on	
  of	
  given	
  variable	
  
node...
BP:	
  example	
  
•  Infer	
  marginal	
  distribu(on	
  of	
  x_3	
  

•  Infer	
  marginal	
  distribu(on	
  of	
  ever...
Posterior	
  is	
  intractable	
  some(mes	
  
•  Example	
  
–  Infer	
  the	
  mean	
  of	
  a	
  Gaussian	
  distribu(o...
Distribu(on	
  Approxima(on	
  
Approximate p(x) with q(x), which belongs to exponential family
Such that: q(x) = h(x)g(η ...
Moment	
  matching	
  
It's called moment matching when q(x) is Gaussian distribution
then u(x) = (x, x 2 )T
=> ∫ q(x)x dx...
EXPECTATION	
  PROPAGATION	
  
=	
  Belief	
  Propaga(on	
  +	
  Moment	
  matching?	
  
Key	
  Idea	
  
•  Approximate	
  each	
  factor	
  with	
  Gaussian	
  distribu(on	
  
•  Approximate	
  corresponding	
 ...
EP:	
  The	
  detail	
  steps	
  
	
  	
  

1.Initialize all of the approximating factors i (θ )
f
2.Initialize the poste...
Example:	
  The	
  cluEer	
  problem	
  
•  Infer	
  the	
  mean	
  of	
  a	
  Gaussian	
  distribu(on	
  
•  Want	
  to	
...
Example:	
  The	
  cluEer	
  problem(2)	
  
•  Approximate	
  complex	
  factor(e.g.	
  mixture	
  
Gaussian)	
  with	
  G...
Applica(on:	
  Bayesian	
  CTR	
  predictor	
  for	
  Bing	
  
•  See	
  the	
  details	
  here	
  
–  Inference	
  step	
...
Experimenta(on	
  
•  Dataset	
  is	
  very	
  Inhomogeneous	
  
	
  

•  Performance	
  
	
  

Model	
  

FTRL	
  

OWLQN...
Application: XBOX skill rating system
•  	
  	
  

See	
  details	
  in	
  P793~798	
  of	
  Machine	
  Learning	
  A	
  P...
Apply	
  to	
  all	
  Bayesian	
  models	
  
•  Infer.net	
  (Microsok/Bishop)	
  
–  A	
  framework	
  for	
  running	
  ...
References	
  
•  Books	
  
–  Chapter	
  2/8/10	
  of	
  PaMern	
  RecogniFon	
  and	
  Machine	
  Learning	
  
–  Chapte...
Upcoming SlideShare
Loading in …5
×

Expectation propagation

1,442 views

Published on

It's the deck for one Hulu internal machine learning workshop, which introduces the background, theory and application of expectation propagation method.

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,442
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
36
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Expectation propagation

  1. 1. Expecta(on  Propaga(on   Theory  and  Applica(on   Dong  Guo   Research  Workshop  2013  Hulu  Internal     See  more  details  in   hEp://dongguo.me/blog/2014/01/01/expecta(on-­‐propaga(on/   hEp://dongguo.me/blog/2013/12/01/bayesian-­‐ctr-­‐predic(on-­‐for-­‐bing/      
  2. 2. Outline   •  •  •  •  Overview   Background   Theory   Applica(ons  
  3. 3. OVERVIEW  
  4. 4. Bayesian  Paradigm   •  Infer  posterior  distribu(on   Prior   Posterior   Make  decision   Data   Note:  figure  of  LDA  is  from  Wikipedia,  and  the  right  figure  is  from  paper  ‘Web-­‐Scale  Bayesian   Click-­‐Through  Rate  PredicFon  for  Sponsored  Search  AdverFsing  in  MicrosoI’s  Bing  Search  Engine’    
  5. 5. Bayesian  inference  methods   •  Exact  inference   –  Belief  propaga(on   •  Approximate  inference   –  Stochas(c  (sampling)   –  Determinis(c   •  Assumed  density  filtering   •  Expecta(on  propaga(on   •  Varia(onal  Bayes  
  6. 6. Message  passing   •  A  form  of  communica(on  used  in  mul(ple   domains  of  computer  science   –  Parallel  compu(ng  (MPI)   –  Object-­‐oriented  programming   –  Inter-­‐process  communica(on   –  Bayesian  inference   •  A  family  of  methods  to  infer  posterior  distribu(on  
  7. 7. Expecta(on  Propaga(on   •  Belongs  to  message  passing  family   •  Approximated  method  (itera(on  is  needed)     •  Very  popular  in  Bayesian  inference,  especially   in  graphic  model  
  8. 8. Researchers   •  Thomas  Minka   –  EP  was  proposed  in  PhD  thesis   •  Kevin  p.  Murphy   –  Machine  Learning  A  ProbabilisFc  PerspecFve  
  9. 9. BACKGROUND  
  10. 10. Background   •  •  •  •  •  •  (Truncated)  Gaussian   Exponen(al  family   Graphic  model   Factor  graph   Belief  propaga(on   Moment  matching  
  11. 11. Gaussian  and  Truncated  Gaussian   •  Gaussian  opera(on  is  a  basis  for  EP  inference   –  Gaussian  +*/  Gaussian   –  Gaussian  integral   •  Truncated  Gaussian  is  used  in  many  EP   applica(ons   •  See  details  here  
  12. 12. Exponen(al  family  distribu(on   •  Very  good  summary  in  Wikipedia   q(z) = h(z)g(η )exp{η T u(z)}     •  Sufficient  sta(s(cs  of  Gaussian  distribu(on:  (x,  x^2)   •  Typical  distribu(on   Note:  above  4  figures  are  from  Wikipedia  
  13. 13. Graphical  Models   •  Directed  graph  (Bayesian  Network)   x1   x2   x4   K P(x) = ∏ p(xk | pak ) k=1 x3   •  Undirected  graph  (Condi(onal   Random  Field)   x1   x2   x4   x3  
  14. 14. Factor  graph   •  Express  rela(on  between  variable  nodes  explicitly   •  Rela(on  in  edge  -­‐>  factor  node   •  Hide  the  difference  of  BN  and  CRF  in  inference   •  Make  inference  more  intui(onal   x1   x2   x4   x3   x1   fa   x2   fc   x4   c   x3  
  15. 15. BELIEF  PROPAGATION  
  16. 16. Belief  Propaga(on  Overview   •  Exact  Bayesian  method  to  infer  marginal   distribu(on   –  ‘sum-­‐product’  message  passing   •  Key  components   –  Calculate  posterior  distribu(on  of  variable  node   –  Two  kinds  of  messages  
  17. 17. Posterior  distribu(on  of  variable  node   •  Factor  graph   p(X) = ∏ Fs (s, X s ), for any variable x in the graph s∈ne( x ) p(x) = ∑ p(X) = ∑ Xx ∏ Fs (s, X s ) = X x s∈ne( x ) ∏ ∑ F (x, X ) = ∏ s s∈ne( x ) X s in which µ fs −>x (x) = ∑ Fs (x, X s ) Xs Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’   s s∈ne( x ) µ fs −>x (x)
  18. 18. Message:  factor  -­‐>  variable  node   •  Factor  graph   µ fs −>x (x) = ∑ ...∑ fs (x, x1 ,..., x M ) x1 xM ∏ xm ∈ne( fs ) x µ xm −> fs (xm ), in which {x1 ,..., x M } is the set of variables on which the factor fs depends Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  19. 19. Message:  variable  -­‐>  factor  node   •  Factor  graph   µ xm −> fs (xm ) = ∏ µ fl −>xm (xm ) l∈ne( xm ) fs Summary:  posterior  distribuFon  is  only  determined  by  factors  !!     Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  20. 20. Whole  steps  of  BP   •  Steps  to  calculate  posterior  distribu(on  of  given  variable   node   –  Step  1:  construct  factor  graph   –  Step  2:  treat  the  variable  node  as  root,  and  ini(alize  messages   sent  from  leaf  nodes   –  Step  3:  leverage  the  message  passing  steps  recursively  un(l  the   root  node  receives  messages  from  all  of  its  neighbors   –  Step  4:  get  the  marginal  distribu(on  by  mul(plying  all  messages   sent  in   Note:  the  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  21. 21. BP:  example   •  Infer  marginal  distribu(on  of  x_3   •  Infer  marginal  distribu(on  of  every  variables   Note:  the  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  22. 22. Posterior  is  intractable  some(mes   •  Example   –  Infer  the  mean  of  a  Gaussian  distribu(on   p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI ) p(θ ) = N(θ | 0,bI ) –  Ad  predictor   Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  23. 23. Distribu(on  Approxima(on   Approximate p(x) with q(x), which belongs to exponential family Such that: q(x) = h(x)g(η )exp{η T u(x)} KL( p || q) = − ∫ p(x)In q(x) dx = − ∫ p(x)Inq(x)dx + ∫ p(x)Inp(x)dx p(x) = − ∫ p(x)Ing(η )dx − ∫ p(x)η T u(x) dx + const = − Ing(η ) − η T Ε p( x ) [u(x)] + const where const terms are independent of the natural parameter η Minimize KL( p || q) by setting the gradient with repect to η to zero: => −∇Ing(η ) = Ε p( x ) [u(x)] By leveraging formula (2.226) in PRML: => E q( x ) [u(x)] = −∇Ing(η ) = Ε p( x ) [u(x)]
  24. 24. Moment  matching   It's called moment matching when q(x) is Gaussian distribution then u(x) = (x, x 2 )T => ∫ q(x)x dx = ∫ p(x)x dx, and ∫ q(x)x 2 dx = ∫ p(x)x 2 dx => meanq( x ) = ∫ q(x)x dx = ∫ p(x)x dx = mean p( x ) , variance q( x ) = ∫ q(x)x 2 dx − (meanq( x ) )2 = ∫ p(x)x 2 dx − (mean p( x ) )2 = variance p( x ) •  Moments  of  a  distribu(on   k'th moment M = ∫ x f (x)dx b k a k
  25. 25. EXPECTATION  PROPAGATION   =  Belief  Propaga(on  +  Moment  matching?  
  26. 26. Key  Idea   •  Approximate  each  factor  with  Gaussian  distribu(on   •  Approximate  corresponding  factor  pairs  one  by  one?   •  Approximate  each  factor  in  turn  in  the  context  of  all   remaining  factors  (Proposed  by  Minka)   refine factor  (θ ) by ensuring q new (θ ) ∝  (θ )q j (θ ) is close with f j (θ )q j (θ ) fj fj q(θ ) in which q (θ ) =  f j (θ ) j
  27. 27. EP:  The  detail  steps       1.Initialize all of the approximating factors i (θ ) f 2.Initialize the posterior approximation by setting : q(θ ) ∝ ∏ i (θ ) f i 3.Until convergence : (a). Choose a fator  (θ ) to refine. fj q(θ ) (b). Remove  (θ ) from the posterior by division : q j (θ ) =  fj f j (θ ) (c). Get the new posterior by settting sufficient statistics of q new f j (θ )q j (θ ) (θ ) equal to those of zj f j (θ )q j (θ ) new 1 (minimize KL( || q (θ ))),in which z j = ∫ f j (θ )q j (θ )dθ , and q new (θ ) = j (θ )q j (θ ) f zj k new  (θ ) :  (θ ) = k q (θ ) (d). Get the refined factor f j fj q j (θ )
  28. 28. Example:  The  cluEer  problem   •  Infer  the  mean  of  a  Gaussian  distribu(on   •  Want  to  try  MLE,  but   p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI ) p(θ ) = N(θ | 0,bI ) •  Approximate  with   q(θ ) = N(θ | m,vI ), and each factor  (θ ) = N(θ | mn ,vn I ) fn –  Approximate  mixture  Gaussian  using  Gaussian   Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  29. 29. Example:  The  cluEer  problem(2)   •  Approximate  complex  factor(e.g.  mixture   Gaussian)  with  Gaussian   fn (θ ) in blue,  (θ ) in red, and q n (θ ) in green fn Remember variance of q n (θ ) is usually very small, so  (θ ) only need to approximate fn (θ ) in small range fn Note:  above  2  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  30. 30. Applica(on:  Bayesian  CTR  predictor  for  Bing   •  See  the  details  here   –  Inference  step  by  step   –  Make  predic(on   •  Some  insights   –  Variance  of  each  feature  increases  aker  every   exposure   –  Sample  with  more  features  will  have  bigger  variance   •  Independent  assump(on  for  features  
  31. 31. Experimenta(on   •  Dataset  is  very  Inhomogeneous     •  Performance     Model   FTRL   OWLQN   Ad  predictor   AUC   0.638   0.641   0.639     –  Other  metrics   •  Pros:  speed,  parameter  choice  cost,  online  learning  support,   interpreta(ve,  support  add  more  factors   •  Cons:  sparse   •  Code  
  32. 32. Application: XBOX skill rating system •      See  details  in  P793~798  of  Machine  Learning  A  ProbabilisFc  PerspecFve       Note:  the  figure  is  from  paper:  ‘TrueSkill:  A  Bayesian  Skill  RaFng  System’    
  33. 33. Apply  to  all  Bayesian  models   •  Infer.net  (Microsok/Bishop)   –  A  framework  for  running  Bayesian  inference  in   graphical  models     –  Model-­‐based  machine  learning    
  34. 34. References   •  Books   –  Chapter  2/8/10  of  PaMern  RecogniFon  and  Machine  Learning   –  Chapter  22  of  Machine  Learning:  A  ProbabilisFc  PerspecFve   •  Papers   –  –  –  –  A  family  of  algorithms  for  approximate  Bayesian  inference   From  belief  propagaFon  to  expectaFon  propagaFon   TrueSkill:  A  Bayesian  Skill  RaFng  System   Web-­‐Scale  Bayesian  Click-­‐Through  Rate  PredicFon  for  Sponsored   Search  AdverFsing  in  MicrosoI’s  Bing  Search  Engine   •  Roadmap  for  EP  

×