Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Expecta(on	
  Propaga(on	
  
Theory	
  and	
  Applica(on	
  

Dong	
  Guo	
  
Research	
  Workshop	
  2013	
  Hulu	
  Inte...
Outline	
  
• 
• 
• 
• 

Overview	
  
Background	
  
Theory	
  
Applica(ons	
  
OVERVIEW	
  
Bayesian	
  Paradigm	
  
•  Infer	
  posterior	
  distribu(on	
  
Prior	
  
Posterior	
  

Make	
  decision	
  

Data	
  
...
Bayesian	
  inference	
  methods	
  
•  Exact	
  inference	
  
–  Belief	
  propaga(on	
  

•  Approximate	
  inference	
 ...
Message	
  passing	
  
•  A	
  form	
  of	
  communica(on	
  used	
  in	
  mul(ple	
  
domains	
  of	
  computer	
  scienc...
Expecta(on	
  Propaga(on	
  
•  Belongs	
  to	
  message	
  passing	
  family	
  
•  Approximated	
  method	
  (itera(on	
...
Researchers	
  
•  Thomas	
  Minka	
  
–  EP	
  was	
  proposed	
  in	
  PhD	
  thesis	
  

•  Kevin	
  p.	
  Murphy	
  
–...
BACKGROUND	
  
Background	
  
• 
• 
• 
• 
• 
• 

(Truncated)	
  Gaussian	
  
Exponen(al	
  family	
  
Graphic	
  model	
  
Factor	
  grap...
Gaussian	
  and	
  Truncated	
  Gaussian	
  
•  Gaussian	
  opera(on	
  is	
  a	
  basis	
  for	
  EP	
  inference	
  
–  ...
Exponen(al	
  family	
  distribu(on	
  
•  Very	
  good	
  summary	
  in	
  Wikipedia	
  

q(z) = h(z)g(η )exp{η T u(z)}
	...
Graphical	
  Models	
  
•  Directed	
  graph	
  (Bayesian	
  Network)	
  

x1	
  

x2	
  

x4	
  
K

P(x) = ∏ p(xk | pak )...
Factor	
  graph	
  
•  Express	
  rela(on	
  between	
  variable	
  nodes	
  explicitly	
  
•  Rela(on	
  in	
  edge	
  -­...
BELIEF	
  PROPAGATION	
  
Belief	
  Propaga(on	
  Overview	
  
•  Exact	
  Bayesian	
  method	
  to	
  infer	
  marginal	
  
distribu(on	
  
–  ‘sum...
Posterior	
  distribu(on	
  of	
  variable	
  node	
  
•  Factor	
  graph	
  

p(X) =

∏

Fs (s, X s ), for any variable x...
Message:	
  factor	
  -­‐>	
  variable	
  node	
  
•  Factor	
  graph	
  

µ fs −>x (x) = ∑ ...∑ fs (x, x1 ,..., x M )
x1
...
Message:	
  variable	
  -­‐>	
  factor	
  node	
  
•  Factor	
  graph	
  

µ xm −> fs (xm ) =

∏

µ fl −>xm (xm )

l∈ne( x...
Whole	
  steps	
  of	
  BP	
  
•  Steps	
  to	
  calculate	
  posterior	
  distribu(on	
  of	
  given	
  variable	
  
node...
BP:	
  example	
  
•  Infer	
  marginal	
  distribu(on	
  of	
  x_3	
  

•  Infer	
  marginal	
  distribu(on	
  of	
  ever...
Posterior	
  is	
  intractable	
  some(mes	
  
•  Example	
  
–  Infer	
  the	
  mean	
  of	
  a	
  Gaussian	
  distribu(o...
Distribu(on	
  Approxima(on	
  
Approximate p(x) with q(x), which belongs to exponential family
Such that: q(x) = h(x)g(η ...
Moment	
  matching	
  
It's called moment matching when q(x) is Gaussian distribution
then u(x) = (x, x 2 )T
=> ∫ q(x)x dx...
EXPECTATION	
  PROPAGATION	
  
=	
  Belief	
  Propaga(on	
  +	
  Moment	
  matching?	
  
Key	
  Idea	
  
•  Approximate	
  each	
  factor	
  with	
  Gaussian	
  distribu(on	
  
•  Approximate	
  corresponding	
 ...
EP:	
  The	
  detail	
  steps	
  
	
  	
  

1.Initialize all of the approximating factors i (θ )
f
2.Initialize the poste...
Example:	
  The	
  cluEer	
  problem	
  
•  Infer	
  the	
  mean	
  of	
  a	
  Gaussian	
  distribu(on	
  
•  Want	
  to	
...
Example:	
  The	
  cluEer	
  problem(2)	
  
•  Approximate	
  complex	
  factor(e.g.	
  mixture	
  
Gaussian)	
  with	
  G...
Applica(on:	
  Bayesian	
  CTR	
  predictor	
  for	
  Bing	
  
•  See	
  the	
  details	
  here	
  
–  Inference	
  step	
...
Experimenta(on	
  
•  Dataset	
  is	
  very	
  Inhomogeneous	
  
	
  

•  Performance	
  
	
  

Model	
  

FTRL	
  

OWLQN...
Application: XBOX skill rating system
•  	
  	
  

See	
  details	
  in	
  P793~798	
  of	
  Machine	
  Learning	
  A	
  P...
Apply	
  to	
  all	
  Bayesian	
  models	
  
•  Infer.net	
  (Microsok/Bishop)	
  
–  A	
  framework	
  for	
  running	
  ...
References	
  
•  Books	
  
–  Chapter	
  2/8/10	
  of	
  PaMern	
  RecogniFon	
  and	
  Machine	
  Learning	
  
–  Chapte...
Upcoming SlideShare
Loading in …5
×

Expectation propagation

1,557 views

Published on

It's the deck for one Hulu internal machine learning workshop, which introduces the background, theory and application of expectation propagation method.

Published in: Technology, Education
  • Be the first to comment

Expectation propagation

  1. 1. Expecta(on  Propaga(on   Theory  and  Applica(on   Dong  Guo   Research  Workshop  2013  Hulu  Internal     See  more  details  in   hEp://dongguo.me/blog/2014/01/01/expecta(on-­‐propaga(on/   hEp://dongguo.me/blog/2013/12/01/bayesian-­‐ctr-­‐predic(on-­‐for-­‐bing/      
  2. 2. Outline   •  •  •  •  Overview   Background   Theory   Applica(ons  
  3. 3. OVERVIEW  
  4. 4. Bayesian  Paradigm   •  Infer  posterior  distribu(on   Prior   Posterior   Make  decision   Data   Note:  figure  of  LDA  is  from  Wikipedia,  and  the  right  figure  is  from  paper  ‘Web-­‐Scale  Bayesian   Click-­‐Through  Rate  PredicFon  for  Sponsored  Search  AdverFsing  in  MicrosoI’s  Bing  Search  Engine’    
  5. 5. Bayesian  inference  methods   •  Exact  inference   –  Belief  propaga(on   •  Approximate  inference   –  Stochas(c  (sampling)   –  Determinis(c   •  Assumed  density  filtering   •  Expecta(on  propaga(on   •  Varia(onal  Bayes  
  6. 6. Message  passing   •  A  form  of  communica(on  used  in  mul(ple   domains  of  computer  science   –  Parallel  compu(ng  (MPI)   –  Object-­‐oriented  programming   –  Inter-­‐process  communica(on   –  Bayesian  inference   •  A  family  of  methods  to  infer  posterior  distribu(on  
  7. 7. Expecta(on  Propaga(on   •  Belongs  to  message  passing  family   •  Approximated  method  (itera(on  is  needed)     •  Very  popular  in  Bayesian  inference,  especially   in  graphic  model  
  8. 8. Researchers   •  Thomas  Minka   –  EP  was  proposed  in  PhD  thesis   •  Kevin  p.  Murphy   –  Machine  Learning  A  ProbabilisFc  PerspecFve  
  9. 9. BACKGROUND  
  10. 10. Background   •  •  •  •  •  •  (Truncated)  Gaussian   Exponen(al  family   Graphic  model   Factor  graph   Belief  propaga(on   Moment  matching  
  11. 11. Gaussian  and  Truncated  Gaussian   •  Gaussian  opera(on  is  a  basis  for  EP  inference   –  Gaussian  +*/  Gaussian   –  Gaussian  integral   •  Truncated  Gaussian  is  used  in  many  EP   applica(ons   •  See  details  here  
  12. 12. Exponen(al  family  distribu(on   •  Very  good  summary  in  Wikipedia   q(z) = h(z)g(η )exp{η T u(z)}     •  Sufficient  sta(s(cs  of  Gaussian  distribu(on:  (x,  x^2)   •  Typical  distribu(on   Note:  above  4  figures  are  from  Wikipedia  
  13. 13. Graphical  Models   •  Directed  graph  (Bayesian  Network)   x1   x2   x4   K P(x) = ∏ p(xk | pak ) k=1 x3   •  Undirected  graph  (Condi(onal   Random  Field)   x1   x2   x4   x3  
  14. 14. Factor  graph   •  Express  rela(on  between  variable  nodes  explicitly   •  Rela(on  in  edge  -­‐>  factor  node   •  Hide  the  difference  of  BN  and  CRF  in  inference   •  Make  inference  more  intui(onal   x1   x2   x4   x3   x1   fa   x2   fc   x4   c   x3  
  15. 15. BELIEF  PROPAGATION  
  16. 16. Belief  Propaga(on  Overview   •  Exact  Bayesian  method  to  infer  marginal   distribu(on   –  ‘sum-­‐product’  message  passing   •  Key  components   –  Calculate  posterior  distribu(on  of  variable  node   –  Two  kinds  of  messages  
  17. 17. Posterior  distribu(on  of  variable  node   •  Factor  graph   p(X) = ∏ Fs (s, X s ), for any variable x in the graph s∈ne( x ) p(x) = ∑ p(X) = ∑ Xx ∏ Fs (s, X s ) = X x s∈ne( x ) ∏ ∑ F (x, X ) = ∏ s s∈ne( x ) X s in which µ fs −>x (x) = ∑ Fs (x, X s ) Xs Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’   s s∈ne( x ) µ fs −>x (x)
  18. 18. Message:  factor  -­‐>  variable  node   •  Factor  graph   µ fs −>x (x) = ∑ ...∑ fs (x, x1 ,..., x M ) x1 xM ∏ xm ∈ne( fs ) x µ xm −> fs (xm ), in which {x1 ,..., x M } is the set of variables on which the factor fs depends Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  19. 19. Message:  variable  -­‐>  factor  node   •  Factor  graph   µ xm −> fs (xm ) = ∏ µ fl −>xm (xm ) l∈ne( xm ) fs Summary:  posterior  distribuFon  is  only  determined  by  factors  !!     Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  20. 20. Whole  steps  of  BP   •  Steps  to  calculate  posterior  distribu(on  of  given  variable   node   –  Step  1:  construct  factor  graph   –  Step  2:  treat  the  variable  node  as  root,  and  ini(alize  messages   sent  from  leaf  nodes   –  Step  3:  leverage  the  message  passing  steps  recursively  un(l  the   root  node  receives  messages  from  all  of  its  neighbors   –  Step  4:  get  the  marginal  distribu(on  by  mul(plying  all  messages   sent  in   Note:  the  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  21. 21. BP:  example   •  Infer  marginal  distribu(on  of  x_3   •  Infer  marginal  distribu(on  of  every  variables   Note:  the  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  22. 22. Posterior  is  intractable  some(mes   •  Example   –  Infer  the  mean  of  a  Gaussian  distribu(on   p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI ) p(θ ) = N(θ | 0,bI ) –  Ad  predictor   Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  23. 23. Distribu(on  Approxima(on   Approximate p(x) with q(x), which belongs to exponential family Such that: q(x) = h(x)g(η )exp{η T u(x)} KL( p || q) = − ∫ p(x)In q(x) dx = − ∫ p(x)Inq(x)dx + ∫ p(x)Inp(x)dx p(x) = − ∫ p(x)Ing(η )dx − ∫ p(x)η T u(x) dx + const = − Ing(η ) − η T Ε p( x ) [u(x)] + const where const terms are independent of the natural parameter η Minimize KL( p || q) by setting the gradient with repect to η to zero: => −∇Ing(η ) = Ε p( x ) [u(x)] By leveraging formula (2.226) in PRML: => E q( x ) [u(x)] = −∇Ing(η ) = Ε p( x ) [u(x)]
  24. 24. Moment  matching   It's called moment matching when q(x) is Gaussian distribution then u(x) = (x, x 2 )T => ∫ q(x)x dx = ∫ p(x)x dx, and ∫ q(x)x 2 dx = ∫ p(x)x 2 dx => meanq( x ) = ∫ q(x)x dx = ∫ p(x)x dx = mean p( x ) , variance q( x ) = ∫ q(x)x 2 dx − (meanq( x ) )2 = ∫ p(x)x 2 dx − (mean p( x ) )2 = variance p( x ) •  Moments  of  a  distribu(on   k'th moment M = ∫ x f (x)dx b k a k
  25. 25. EXPECTATION  PROPAGATION   =  Belief  Propaga(on  +  Moment  matching?  
  26. 26. Key  Idea   •  Approximate  each  factor  with  Gaussian  distribu(on   •  Approximate  corresponding  factor  pairs  one  by  one?   •  Approximate  each  factor  in  turn  in  the  context  of  all   remaining  factors  (Proposed  by  Minka)   refine factor  (θ ) by ensuring q new (θ ) ∝  (θ )q j (θ ) is close with f j (θ )q j (θ ) fj fj q(θ ) in which q (θ ) =  f j (θ ) j
  27. 27. EP:  The  detail  steps       1.Initialize all of the approximating factors i (θ ) f 2.Initialize the posterior approximation by setting : q(θ ) ∝ ∏ i (θ ) f i 3.Until convergence : (a). Choose a fator  (θ ) to refine. fj q(θ ) (b). Remove  (θ ) from the posterior by division : q j (θ ) =  fj f j (θ ) (c). Get the new posterior by settting sufficient statistics of q new f j (θ )q j (θ ) (θ ) equal to those of zj f j (θ )q j (θ ) new 1 (minimize KL( || q (θ ))),in which z j = ∫ f j (θ )q j (θ )dθ , and q new (θ ) = j (θ )q j (θ ) f zj k new  (θ ) :  (θ ) = k q (θ ) (d). Get the refined factor f j fj q j (θ )
  28. 28. Example:  The  cluEer  problem   •  Infer  the  mean  of  a  Gaussian  distribu(on   •  Want  to  try  MLE,  but   p(x | θ ) = (1− w)N(x | θ , I ) + wN(x | 0,aI ) p(θ ) = N(θ | 0,bI ) •  Approximate  with   q(θ ) = N(θ | m,vI ), and each factor  (θ ) = N(θ | mn ,vn I ) fn –  Approximate  mixture  Gaussian  using  Gaussian   Note:  the  figure  is  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  29. 29. Example:  The  cluEer  problem(2)   •  Approximate  complex  factor(e.g.  mixture   Gaussian)  with  Gaussian   fn (θ ) in blue,  (θ ) in red, and q n (θ ) in green fn Remember variance of q n (θ ) is usually very small, so  (θ ) only need to approximate fn (θ ) in small range fn Note:  above  2  figures  are  from  book  ‘PaMern  recogniFon  and  machine  learning’  
  30. 30. Applica(on:  Bayesian  CTR  predictor  for  Bing   •  See  the  details  here   –  Inference  step  by  step   –  Make  predic(on   •  Some  insights   –  Variance  of  each  feature  increases  aker  every   exposure   –  Sample  with  more  features  will  have  bigger  variance   •  Independent  assump(on  for  features  
  31. 31. Experimenta(on   •  Dataset  is  very  Inhomogeneous     •  Performance     Model   FTRL   OWLQN   Ad  predictor   AUC   0.638   0.641   0.639     –  Other  metrics   •  Pros:  speed,  parameter  choice  cost,  online  learning  support,   interpreta(ve,  support  add  more  factors   •  Cons:  sparse   •  Code  
  32. 32. Application: XBOX skill rating system •      See  details  in  P793~798  of  Machine  Learning  A  ProbabilisFc  PerspecFve       Note:  the  figure  is  from  paper:  ‘TrueSkill:  A  Bayesian  Skill  RaFng  System’    
  33. 33. Apply  to  all  Bayesian  models   •  Infer.net  (Microsok/Bishop)   –  A  framework  for  running  Bayesian  inference  in   graphical  models     –  Model-­‐based  machine  learning    
  34. 34. References   •  Books   –  Chapter  2/8/10  of  PaMern  RecogniFon  and  Machine  Learning   –  Chapter  22  of  Machine  Learning:  A  ProbabilisFc  PerspecFve   •  Papers   –  –  –  –  A  family  of  algorithms  for  approximate  Bayesian  inference   From  belief  propagaFon  to  expectaFon  propagaFon   TrueSkill:  A  Bayesian  Skill  RaFng  System   Web-­‐Scale  Bayesian  Click-­‐Through  Rate  PredicFon  for  Sponsored   Search  AdverFsing  in  MicrosoI’s  Bing  Search  Engine   •  Roadmap  for  EP  

×