Upcoming SlideShare
×

# Olivier Cappé's talk at BigMC March 2011

707 views

Published on

Olivier Cappé's talk at BigMC seminar on 3rd March 2011, regarding his new online EM algorithm

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
707
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
5
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Olivier Cappé's talk at BigMC March 2011

1. 1. Online EM Algorithm and Some Extensions Olivier Capp´ e T´l´com ParisTech & CNRS ee March 20110. Capp´ (@ BigMC) e Online EM Algorithm March 2011 1 / 34
2. 2. Online Estimation for Missing Data ModelsBased on (C & Moulines, 2009) and (C, 2010)Goals 1 Maximum likelihood estimation, or 1’ Competitive with maximum likelihood estimation when #obs. is large 2 Good scaling (performance vs. computational cost) as #obs. increases (3) Process data on-the-ﬂy (no storage) 4 Simple to implement (no line-search, projection, preconditioning, etc.) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 2 / 34
3. 3. Outline1 The EM Algorithm in Exponential Families2 The Limiting EM Recursion3 Online EM Algorithm The Algorithm Properties and Discussion4 Use for Batch ML Estimation5 Extensions6 References 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 3 / 34
4. 4. The EM Algorithm in Exponential FamiliesMissing Data ModelA missing data model is a statistical model {pθ (x, y)}θ∈Θ in which only Ymay be observed (the couple (X, Y ) is referred to as the complete data) Hence, parameter estimates θn must be function of observations Y1 , . . . , Yn only (here assumed to be independent and identically distributed) Of course, the statistical model could also be deﬁned as {fθ (y)}θ∈Θ , where fθ (y) = pθ (x, y)dx but the speciﬁc structure of fθ needs to be exploitedTo analyze the methods the data {Yt }t≥1 is assumed to be generated byan iid. process with marginal π, not necessarily equal to fθ 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 4 / 34
5. 5. The EM Algorithm in Exponential FamiliesFinite Mixture ModelMixture PDF m f (y) = αi fi (y) i=1Missing Data Interpretation P(Xt = i) = αi Yt |Xt = i ∼ fi (y) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 5 / 34
6. 6. The EM Algorithm in Exponential FamiliesTo determine the maximum likelihood estimate n θn = arg max log fθ (Yt ) θ t=1numerically, the standard approach is the following.Expectation-Maximization (Dempster, Laird & Rubin, 1977) kGiven a current parameter guess θn E-Step Compute n 1 qn,θn (θ) = k Eθn [ log pθ (Xt , Yt )| Yt ] k n t=1 M-Step Update the parameter estimate to k+1 θn = arg max qn,θn (θ) k θ∈Θ 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 6 / 34
7. 7. The EM Algorithm in Exponential FamiliesRationale 1 It is an ascent algorithm (shown using Jensen inequality) Figure: The EM intermediate quantity is a minorizing surrogate 2 Because of Fisher relation, the algorithm can only stop in a stationary point of the log-likelihood∗ ∗ See (Wu, 1983) for necessary topologicaland regularity assumptions 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 7 / 34
8. 8. The EM Algorithm in Exponential FamiliesAn Example: Poisson Mixture Likelihood m λj Y −λj fθ (Y ) = αj e Y! j=1“Complete-Data” Log-Likelihood log pθ (X, Y ) = − log(Y !) m + [log(αj ) − λj ] 1{X = j} j=1 m + log(λj )Y 1{X = j} j=1 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 8 / 34
9. 9. The EM Algorithm in Exponential FamiliesEM Algorithm for the Poisson Mixture EM E-Step m n 1 qn,θn = k [log(αj ) − λj ] Pθn (Xt = j|Yt ) k n j=1 t=1 m n 1 + log(λj ) Yt Pθn (Xt = j|Yt ) k n j=1 t=1EM M-Step n k+1 1 αn,j = Pθn (Xt = j|Yt ) k n t=1 n t=1 Yt Pθn (Xt = j|Yt ) k λk+1 = n,j n t=1 Pθn (Xt = j|Yt ) k 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 9 / 34
10. 10. The EM Algorithm in Exponential FamiliesExponential Family ModelIn the following, we assume that the complete-data model belongs to anexponential family(Curved) Exponential Family Model pθ (x, y) = exp ( s(x, y), ψ(θ) − A(θ)) where s(x, y) is the vector (complete-data) suﬃcient statisticsExplicit Complete-Data Maximum Likelihood ¯ S → θ(S) = arg max S, ψ(θ) − A(θ) θ is available in closed-form 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 10 / 34
11. 11. The EM Algorithm in Exponential FamiliesThe EM Algorithm RevisitedThe k-th EM Iteration (From n Observations) E-Step n k+1 1 Sn = Eθn [ s(Xt , Yt )| Yt ] k n t=1 M-Step k+1 ¯ k+1 θn = θ Sn 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 11 / 34
12. 12. The Limiting EM RecursionA Key RemarkThe k-th EM Iteration (From n Observations) E-Step n k+1 1 Sn = Eθn [ s(Xt , Yt )| Yt ] k n t=1 M-Step k+1 ¯ k+1 θn = θ SnCan be fully reparameterized in the domain of suﬃcient statistics n k+1 1 Sn = Eθ(Sn ) [ s(Xt , Yt )| Yt ] ¯ k n t=1 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 12 / 34
13. 13. The Limiting EM RecursionThe Limiting EM RecursionBy letting n tend to inﬁnity, one obtains two equivalent updates:Suﬃcient Statistics Update S k = Eπ Eθ(S k−1 ) [ s(X1 , Y1 )| Y1 ] ¯Parameter Update ¯ θk = θ {Eπ (Eθk−1 [ s(X1 , Y1 )| Y1 ])}Using usual EM arguments, these updates are such that 1 The Kullback-Leibler divergence D(π|fθk ) is monotonically decreasing with k 2 Converge to {θ : θ D(π|fθ ) = 0} 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 13 / 34
14. 14. The Limiting EM RecursionBatch EM Is Not Eﬃcient for Large Data Recordssee also (Neal & Hinton, 1999) 3 2 10 observations 4 3 ||u||2 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3 20 10 observations 4 3 ||u||2 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Batch EM iterationsFigure: Convergence of batch EM estimates of u 2 as a function of the number of EM iterations for 2,000 (top) and20,000 (bottom) observations. The box-and-whisker plots are computed from 1,000 independent replications of the simulateddata. The grey region corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussianapproximation of the MLE (from [C, 2010]). 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 14 / 34
15. 15. Online EM Algorithm The AlgorithmThe Online EM Algorithm The online EM algorithm outputs one updated parameter estimate θn after processing each individual observation Yn The parameter update is very similar to applying the EM algorithm to the single observation Yn (with smoothing) The memory footprint of the algorithm is constant while its computational cost is proportional to the number of processed observations 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 15 / 34
16. 16. Online EM Algorithm The AlgorithmOnline EM: Rationale We try to locate the solutions of Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S = 0 ¯ Viewing Eθ(S) [ s(Xn , Yn )| Yn ] as a noisy observation of ¯ Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] , this is exactly the usual Stochastic ¯ Approximation (or Robbins-Monro) setup: Sn = Sn−1 + γn Eθ(Sn−1 ) [ s(Xn , Yn )| Yn ] − Sn−1 ¯ where (γn ) is a sequence of decreasing positive stepsizes 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 16 / 34
17. 17. Online EM Algorithm The AlgorithmThe AlgorithmOnline EM AlgorithmStochastic E-Step Sn = (1 − γn )Sn−1 + γn Eθn−1 [ s(Xn , Yn )| Yn ] M Step ¯ θn = θ(Sn )Practical Recommendations γn = 1/nα with α ∈ [0.6, 0.7] Don’t do M for the ﬁrst 10–20 obs. (optional) Use Polyak-Ruppert averaging (requires to chose n0 ) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 17 / 34
18. 18. Online EM Algorithm The AlgorithmOnline EM in the Poisson Mixture ExampleSA E-StepComputing Conditional Expectations αn−1,j λYn e−λn−1,j n−1,j pn,j = Pm i=1 αn−1,i λYn e−λn−1,i n−1,iStatistics Update (Stochastic Approximation) α α Sn,j = (1 − γn )Sn−1,j + γn pn,j λ λ Sn,j = (1 − γn )Sn−1,j + γn pn,j YnM-Step: Parameter Update α αn,j = Sn,j , ˆ ˆ λ α λn,j = Sn,j /Sn,j 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 18 / 34
19. 19. Online EM Algorithm Properties and DiscussionAnalysis(C & Moulines, 2009)Under n γn = ∞, 2 < ∞, compactness of Θ and other regularity n γnassumptions 1 The estimate θn converges to one of the roots of θ D(π|fθ ) =0 2 The algorithm is asymptotically equivalent to θn = θn−1 + γn J −1 (θn−1 ) θ log fθn−1 (Yn ) where J(θ) = −Eπ Eθ 2 log pθ (X1 , Y1 ) Y1 θ 3 For a well speciﬁed model (π = fθ ) and under Polyak-Ruppert averaging† θn is Fisher eﬃcient √ L n(θn − θ ) −→ N (0, If (θ )) where If (θ ) = −Eθ [ 2 log f (Y )] θ θ 1 †˜ = 1/(n − n0 ) n 0 +1 θn , P θn t=n −αwith γn = n and α ∈ (1/2, 1) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 19 / 34
20. 20. Online EM Algorithm Properties and DiscussionSome More Details 1 (Andrieu et al., 2005) but also (Delyon, 1994), (Bena¨ 1999) using ım, the fact that D(π|fθ(S) ) is a Lyapunov function: ¯ S D(π|fθ(S) ) , ¯ Eπ Eθ(S) [ s(X1 , Y1 )| Y1 ] − S ¯ ≤0 mean ﬁeld 2 ¯ Taylor series expansion of θ to establish the equivalence (with remainder a.s. o(γn )) 3 (Pelletier, 1998) to show that −1/2 L −1 γn (θn − θ ) −→ N (0, Ip (θ )/2) in well-speciﬁed models (where Ip is the complete-data Fisher information matrix) General results of (Polyak and Judistky, 1992), (Mokkadem and Pelletier, 2006) on averaging 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 20 / 34
21. 21. Online EM Algorithm Properties and DiscussionIllustration of Polyak-Ruppert Averaging α = 0.9 2 1 0 u −2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 α = 0.6 2 1 0 u −2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 α = 0.6 with halfway averaging 2 1 0 u −2 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of observationsFigure: Four superimposed trajectories of the estimate of u1 (ﬁrst component of u) for various algorithm settings(α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). The actual value of u1 is equal tozero. 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 21 / 34
22. 22. Online EM Algorithm Properties and DiscussionPerformance of Online EM α = 0.9 4 3 ||u||2 2 1 0 0.2 10^3 2 10^3 20 10^3 α = 0.6 4 3 ||u||2 2 1 0 0.2 10^3 2 10^3 20 10^3 α = 0.6 with halfway averaging 4 3 ||u||2 2 1 0 0.2 10^3 2 10^3 20 10^3 Number of observationsFigure: Online EM estimates of u 2 for various data sizes (200, 2,000 and 20,000 observations, from left to right) andalgorithm settings (α = 0.9, α = 0.6 and α = 0.6 with Polyak-Ruppert averaging, from top to bottom). Thebox-and-whisker plots (outliers plotting suppressed) are computed from 1,000 independent replications of the simulated data.The grey regions corresponds to ±2 interquartile range (approx. 99.3% coverage) under the asymptotic Gaussianapproximation of the MLE. 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 22 / 34
23. 23. Online EM Algorithm Properties and DiscussionRelated Works(Titterington, 1984) Proposes a gradient algorithm −1 θn = θn−1 + γn Ip (θn−1 ) θ log fθn−1 (Yn ) It is asymptotically equivalent to the algorithm (previously described) for well-speciﬁed models (π = fθ )(Neal & Hinton, 1999) Describe an algorithm called Incremental EM that is equivalent (up to ﬁrst batch scan only) to Online EM used with γn = 1/n(Sato, 2000; Sato & Ishii, 2000) Describe the algorithm and provide some analysis in the ﬂat model case and for mixtures of Gaussian 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 23 / 34
24. 24. Online EM Algorithm Properties and DiscussionHow Does This Work in Practice? Fine But don’t use ‡ γn = 1/n Simulations in (C & Moulines, 2009) on mixtures of Gaussian regressionsLarge Scale Experiments on Real Data in (Liang & Klein, 2009), where the use of mini-batch blocking was found useful: Apply the proposed algorithm considering Ymk+1 , Ymk+2 . . . Ym(k+1) as one observation Mini-batch blocking is useful in dealing with mixture-like models with infrequent components ‡ γn = γ0 /(n0 + n) can be an optionbut requires carefully setting γ0 and n0 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 24 / 34
25. 25. Online EM Algorithm Properties and DiscussionSome Intuition About the WeightsIf rk = (1 − γk )rk−1 + γk Ek , for k ≥ 1 1 −4 x 10 α=1 1 n n n 1 rn = k=1 ωk Ek + ω0 r0 with 1 n n k=0 ωk = 1 4 −4 x 10 α = 0.9 n 1 2 ωk = n+a (for k ≥ 1) when 2 γk = 1/(k + a) and is strictly 0 increasing otherwise 4 −3 x 10 α = 0.6 n 3 n 2 k=1 (ωk )≡ 2 n−α when γk = k −α , 1 2 with 1/2 < α < 1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 25 / 34
26. 26. Use for Batch ML EstimationHow to Use Online EM for Batch ML Estimation?The most popular use of the method is to perfom batch ML estimationfrom very large datasetsBecause we did not assume that π = fθ , the previous analysis can beapplied to π ≡ the empirical measure associated with Y1 , . . . , Yn Online EM can be used for batch ML estimation by (randomly) scanning the data Y1 , . . . , Yn Convergence “speed” (with averaging) is (nobs. × nscans )−1/2 versus ρnscans for batch EM Not a fair comparison in terms of computing time as the M-Step is not free and possible parallelization is ignored 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 26 / 34
27. 27. Use for Batch ML EstimationComparison With Batch and Incremental EM Batch EM −1.54 −1.56 −1.58 1 2 3 4 5 Incremental EM −1.54 −1.56 −1.58 1 2 3 4 5 Online EM −1.54 −1.56 −1.58 1 2 3 4 5 batch toursFigure: Normalized log-likelihood of the estimates obtained with, from top to bottom, batch EM, incremental EM andonline EM as a function of the number of batch tours (or iterations, for batch EM). The data is of length N = 100 and the boxan whiskers plots summarize the results of 500 independent runs of the algorithms started from randomized starting points θ0 . 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 27 / 34
28. 28. Use for Batch ML EstimationComparison With Batch and Incremental EM (Contd.) Batch EM −1.56 −1.58 −1.6 1 2 3 4 5 Incremental EM −1.56 −1.58 −1.6 1 2 3 4 5 Online EM −1.56 −1.58 −1.6 1 2 3 4 5 batch tours Figure: Same display for a data record of length N = 1,000. 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 28 / 34
29. 29. ExtensionsSummary The Good Easy (esp. when EM implementation is available) Can be used for ML estimation from a batch of observations Robust wrt. to stepsize selection (note that scale is ﬁxed due to the use of convex combinations) Handles parameter constraints nicely (only requires that S be closed under convex combinations with expected suﬃcient statistics) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 29 / 34
30. 30. ExtensionsSummary (Contd.) The Bad Needs that the E-step be explicit ¯ Needs that θ be explicit Not appropriate for short (say, less than 1000 observations) data records without cycling What about non-independent observations? 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 30 / 34
31. 31. ExtensionsOnline EM in Latent Factor Models (Ongoing Work)Many models of the form Cn |Hn ∼ gPK k=1 θk Hn,kwhere {gλ }λ∈Λ is an exponential family of distributions and Hn is a latentrandom vector of positive weights (probabilistic matrix factorization,discrete component analysis, partial membership models, simplicialmixtures) Figure: Bayesian network representations of Latent Dirichlet Allocation (LDA) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 31 / 34
32. 32. ExtensionsSimulated Online EM Algorithm for LDAFor n = 1, . . .Simulated E-step ˜ Simulate Hn given Cn and θn−1 (in practise, using a short run of Metropolis-Hastings or collapsed Gibbs sampling) Use the Rao-Blackwellized update ˜ Sn = (1−γn )Sn−1 +γn Eθn−1 s(Zn , Wn )| Wn , Hn ¯ M-step θn = θ(Sn ) 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 32 / 34
33. 33. ExtensionsIgnoring the sampling bias, this recursion can be analyzed and has thesame asymptotic properties as the online EM algorithmIn particular, for well-speciﬁed models, −1/2 L −1 γn (θn − θ ) −→ N (0, If (θ ))instead of −1/2 L −1 γn (θn − θ ) −→ N (0, Ip (θ ))for the “exact” online EM algorithm (Ip (θ ) = −Eθ [ 2 log p (X , Y )]). θ θ 1 1 0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 33 / 34
34. 34. ReferencesCapp´, O. & Moulines, E. (2009). On-line expectation-maximization algorithm for elatent data models. J. Roy. Statist. Soc. B, 71(3):593-613.Capp´, O. (2011). Online Expectation-Maximisation. To appear in Mengersen, K., eTitterington, M., & Robert, C. P., eds., Mixtures, Wiley.Liang, P. & Klein, D. (2009). Online EM for Unsupervised Models. In ProcNAACL Conference.Neal, R. M. & Hinton, G. E. (1999). A view of the EM algorithm that justiﬁesincremental, sparse, and other variants. In Jordan, M. I., ed., Learning in graphicalmodels, pages 355–368. MIT Press, Cambridge, MA, USA.Rohde, D. & Capp´, O. (2011). Online maximum-likelihood estimation for latent efactor models. Submitted.Sato, M. (2000). Convergence of on-line EM algorithm. In Proc. InternationalConference on Neural Information Processing, 1:476–481.Sato, M. & Ishii, S. (2000). On-line EM algorithm for the normalized Gaussiannetwork. Neural Computation, 12:407-432.Titterington, D. M. (1984). Recursive parameter estimation using incompletedata. J. Roy. Statist. Soc. B, 46(2):257-267.0. Capp´ (@ BigMC) e Online EM Algorithm March 2011 34 / 34