I show how to obtain approximate maximum likelihood inference for "complex" models having some latent (unobservable) component. With "complex" I mean models having a so-called intractable likelihood, where the latter is unavailable in closed for or is too difficult to approximate. I construct a version of SAEM (and EM-type algorithm) that makes it possible to conduct inference for complex models. Traditionally SAEM is implementable only for models that are fairly tractable analytically. By introducing the concept of synthetic likelihood, where information is captured by a series of user-defined summary statistics (as in approximate Bayesian computation), it is possible to automatize SAEM to run on any model having some latent-component.
A likelihood-free version of the stochastic approximation EM algorithm (SAEM) for parameter estimation in complex models
1. A likelihood-free version of the stochastic
approximation EM algorithm (SAEM) for
parameter estimation in complex models
Umberto Picchini
Centre for Mathematical Sciences,
Lund University
twitter: @uPicchini
umberto@maths.lth.se
18 October 2016, Department of Computer and Information Science,
Linköping University.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
2. This presentation is based on the working paper:
P. (2016). Likelihood-free stochastic approximation EM for inference
in complex models, arXiv:1609.03508.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
3. I will consider:
the problem of parameter inference with “complex models”, i.e.
models having an intractable likelihood.
the inference problem for “incomplete data”, in the sense given
by the seminal EM-paper [Dempster et al. 1977].
In two words, what I investigate is:
we have data Y arising from a generic model depending on the
unobservable X and parameter θ.
How do we estimate θ from Y, in presence of the latent X?
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
4. The presence of the latent (unobservable) X means that we deal with
an incomplete data problem.
The EM algorithm1 is the standard way to conduct
maximum-likelihood inference for θ in presence of incomplete data.
The complete data is the couple (Y, X), and the corresponding
complete likelihood is p(Y, X; θ).
The incomplete (data) likelihood is p(Y; θ).
We are interested in finding the MLE
ˆθ = arg max
θ∈Θ
p(Y; θ)
given observations Y = (Y1, ..., Yn).
1
Dempster, Laird and Rubin, 1977. Maximum likelihood from incomplete data
via the EM algorithm. JRSS-B.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
5. In the rest of this presentation I will discuss:
SAEM: a popular stochastic version of EM, for when EM is not
directly applicable.
Implementing SAEM is difficult! And impossible for models
with intractable likelihoods. What to do?
A quick intro to Wood’s synthetic likelihoods (SL).
Our contribution embedding SL within SAEM.
Simulation studies.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
6. EM in one slide
EM is a two steps procedure: E-step followed by the M-step.
Define
Q(θ|θ ) = log pY,X(Y, X; θ)pX|Y(X|Y; θ )dX ≡ EX|Y log pY,X(Y, X; θ).
At iteration k 1
E-step: compute Q(θ| ˆθ
(k−1)
);
M-step: obtain ˆθ
(k)
= arg maxθ∈Θ Q(θ| ˆθ
(k−1)
).
As k → ∞ the sequence { ˆθ
(k)
}k converges to a stationary point of the
data likelihood p(Y; θ) under weak assumptions.
Typically, E-step is hard while M-step is “easy”.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
7. How to get around the E-step
The E-step requires the evaluation of:
Q(θ|θ ) = log pY,X(Y, X; θ)pX|Y(X|Y; θ )dX.
This is hard, as pX|Y(X|Y; ·) is typically unknown.
MCEM [Wei-Tanner 1990]
Assume we are able to simulate draws from pX|Y(X|Y; ·) say mk times
→ Monte-Carlo approximation:
generate xr ∼ pX|Y(X|Y; ·), r = 1, ..., mk;
Q(θ|θ ) ≈ 1
mk
mk
r=1 log pY,X(Y, xr; θ).
Problem: mk needs to increase as k increases. Double asymptotic
problem!
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
8. SAEM (stochastic approximation EM)
A more efficient approximation to the E-step is is given by SAEM2
generate xr ∼ pX|Y(X|Y; ·), r = 1, ..., mk;
˜Q(θ| ˆθ
(k)
) =
(1 − γk) ˜Q(θ| ˆθ
(k−1)
) + γk
1
mk
mk
r=1 log pY,X(Y, xr; θ) .
With {γk} a decreasing sequence such that k γk = ∞, k γ2
k < ∞.
As k → ∞, it is not required for mk to increase, in fact it is possible to
take mk ≡ 1 for all k, however see next slide for convergence
properties.
2
Delyon, Lavielle and Moulines, 1999. Convergence of a stochastic
approximation version of the EM algorithm. Annals of Statistics.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
9. Beautiful things happen if you manage to write log p(Y, X) as a member of
the curved exponential family, e.g.
log p(Y, X; θ) = −Λ(θ) + Sc(Y, X), Γ(θ) . (1)
Here ... is the scalar product, Λ and Γ are two functions of θ and Sc(Y, X)
is the minimal sufficient statistic of the complete model.
Then we only need to update the sufficient statistics
sk = sk−1 + γk(Sc(Y, X(k)
) − sk−1).
Computing Sc(Y, X) for most non-trivial models is hard! But if you manage,
the M-step is often explicit:
θ
(k)
= arg max
θ∈Θ
(−Λ(θ) + sk, Γ(θ) )
Only for case (1) Delyon et al. (1999) prove convergence of the sequence
{θk}k to a stationary point of p(Y; θ) under weak conditions.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
10. Some considerations
General problem with all EM-type algorithms: we assumed the ability to
simulate latent states from p(X|Y). This is often not trivial.
For state-space models, plenty of possibilities given by particle filters
(sequential Monte Carlo). In this case, the sampling issue is
“solvable”.
What to do outside of state-space models? What if the model has no
dynamic structure?
What if the model is so complex that we can’t write pY,X(Y, X) in
closed form?
Example, for SDE models the transition density of the underlying Markov
process is unknown.
Then we cannot write p(X0:n) = n
j=1 p(Xj|Xj−1), hence we cannot write
pY,X(Y0:n, X0:n) = p(Y0:n|X0:n)p(X0:n).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
11. If we can’t write the complete likelihood certainly we cannot hope to
find the sufficient statistics Sc(·).
Specifically: it is impossible to apply SAEM for models having
intractable likelihoods, e.g. models for which we can’t write p(Y, X)
in closed form.
Likelihood-free methods use the ability to simulate from a model to
compensate for our ignorance about the underlying likelihood.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
12. Say we formulate a statistical model p(Y; θ) such that n observations
are assumed Yj ∼ p(Y; θ), j = 1, .., n.
Suppose we do not know p(Y; ·), however
we do know how to implement a simulator to generate draws
from p(Y; ·).
Trivial example (but you get the idea)
y = x + ε, x ∼ px, ε ∼ N(0, σ2
ε)
simulate x∗
∼ px(X) [possible even when px unknown!]
simulate y∗
∼ N(x∗
, σ2
ε), then y∗
∼ py(Y|σε)
Therefore, in the following we consider the case where the only thing
we know is how to forward simulate from an assumed model.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
13. Bayes: complex networks might not allow for trivial sampling (Gibbs-type),
i.e. when the conditional densities are unknown.
[Pic from Schadt et al. (2009) doi:10.1038/nrd2826]
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
14. The ability to simulate from a model even when we have no
knowledge of the analytic expressions of the underlying likelihood(s),
is central in likelihood-free methods for intractable likelihoods.
Several ways to deal with “intractable likelihoods”.
“Plug-and-play methods”: the only requirements is the ability to
simulate from the data-generating-model.
particle marginal methods (PMMH, PMCMC) based on SMC
filters [Andrieu et al. 2010].
(improved) Iterated filtering [Ionides et al. 2015]
approximate Bayesian computation (ABC) [Marin et al. 2012].
Synthetic likelihoods [Wood 2010].
In the following I focus on Synthetic Likelihoods.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
15. A nearly chaotic model
Two realizations from a Ricker model.
yt ∼ Poi(φNt)
Nt = r · Nt−1 · e−Nt−1
.
Small changes in r cause major departures from data.051015
nt
Time
5 10 15 20 25
−260−220−180−140
Log−likelihood
log(r
2.5 3.0 3.5
Figure: One path generated with log r = 3.8 (black) and one generated with
log r = 3.799 (red).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
16. The resulting likelihood can be difficult to explore if algorithms are
badly initialized.
2.5 3.0 3.5 4.0
−15−10−5
Ricker
log(r)
271217
nt
−5 −4
−35−25−15−50
Pen
lo
−50
Varley
0−5
Mayna
Log−likelihood(103
)
Figure: The loglikelihood is in black.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
17. A change of paradigm
from S. Wood, Nature 2010:
“Naive methods of statistical inference try to make the model
reproduce the exact course of the observed data in a way that the real
system itself would not do if repeated.”
“What is important is to identify a set of statistics that is sensitive to
the scientifically important and repeatable features of the data, but
insensitive to replicate-specific details of phase.”
In other words, with complex, stochastic and/or chaotic model we
could try to match features of the data, not the path of the data itself.
A similar approach is considered in ABC (approximate Bayesian
computation).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
18. Synthetic likelihoods
y: observed data, from static or dynamic models
s(y): (vector of) summary statistics of data, e.g. mean,
autocorrelations, marginal quantiles etc.
assume
s(y) ∼ N(µθ, Σθ)
an assumption justifiable via second order Taylor expansion
(same as in Laplace approximations).
µθ and Σθ unknown: estimate them via simulations.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
20. For fixed θ simulate R artificial datasets y∗
1 , ..., y∗
R from your model and
compute corresponding (possibly vector valued) summaries s∗
1 , ..., s∗
R.
compute
ˆµθ =
1
R
R
r=1
s∗
r , ˆΣθ =
1
R − 1
R
r=1
(s∗
r − ˆµθ)(s∗
r − ˆµθ)
compute the statistics sobs for the observed data y.
evaluate a multivariate Gaussian likelihood at sobs
liksyn(θ) := N(sobs; ˆµθ, ˆΣθ) ∝ | ˆΣθ|−1/2
exp
−(sobs − ˆµθ) ˆΣ
−1
θ (sobs − ˆµθ)
2
This likelihood can be maximized for a varying θ or be plugged within
an MCMC algorithm targeting
ˆπ(θ|sobs) ∝ liksyn(θ)π(θ).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
21. So the synthetic likelihood methodology assumes no specific
knowledge of the probabilistic features of the model.
Only assumes the ability to forward-generate from the model.
assumes that the analyst is able to specify “informative”
summaries.
assumes that said summaries are (approximately) Gaussian
s ∼ N(·).
Transform the summaries to be ≈ N is often not an issue (just as we
do in linear regression).
Of course the major issue (still open, also in ABC) is how to build
informative summaries. This is left unsolved.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
22. I intend to use the synthetic likelihoods approach to enable
likelihood-free inference using SAEM.
This should allow SAEM to be applied to intractable likelihood
models.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
23. We use synthetic likelihoods to construct a Gaussian approximation
over a set of complete summaries (S(Y), S(X)) to define a complete
synthetic loglikelihood.
the complete synthetic loglikelihood
log p(s; θ) = log N(s; µ(θ), Σ(θ)), (2)
with s = (S(Y), S(X))
In (2) µ(θ) and Σ(θ) are unknown but can be estimated using
synthetic likelihoods (SL), conditionally on θ.
However we need to obtain a maximizer for the (incomplete)
synthetic loglikelihood log p(S(Y); θ).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
24. SAEM with synthetic likelihoods (SL)
For given θ SL returns estimates ˆµ(θ) and ˆΣ(θ) (sample mean and
sample covariance).
Crucial result
For a Gaussian likelihood ˆµ(θ) and ˆΣ(θ) are sufficient statistics for
µ(θ) and Σ(θ). And a Gaussian is member of the exponential family.
Recall: what SAEM does is to update sufficient statistics, perfect for
us!
At kth SAEM iteration:
ˆµ(k)
(θ) = ˆµ(k−1)
(θ) + γ(k)
( ˆµ(θ) − ˆµ(k−1)
(θ)) (3)
ˆΣ
(k)
(θ) = ˆΣ
(k−1)
(θ) + γ(k)
( ˆΣ(θ) − ˆΣ
(k−1)
(θ)). (4)
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
25. Updating the latent variable X
At kth iteration of SAEM we need to sample S(X(k))|S(Y). This is
trivial!.
We have
S(X(k)
)|S(Y) ∼ N( ˆµ
(k)
x|y (θ), ˆΣ
(k)
x|y (θ))
where
ˆµ
(k)
x|y = ˆµx + ˆΣxy
ˆΣ
−1
y (S(Y) − ˆµy)
ˆΣ
(k)
x|y = ˆΣx − ˆΣxy
ˆΣ
−1
y
ˆΣyx
where ˆµx, ˆµy, ˆΣx, ˆΣy, ˆΣxy and ˆΣyx are extracted from ( ˆµ(k)
, ˆΣ
(k)
).
That is ˆµ(k)
(θ) = ( ˆµx, ˆµy) and
ˆΣ
(k)
(θ) =
Σx Σxy
Σyx Σy
.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
26. The M-step
Now that we have simulated a S(X(k)) (conditional on data) lets
produce the complete summaries at iteration k:
s(k)
:= (S(Y), S(X(k)
))
and maximize (M-step) the complete synthetic loglikelihood:
ˆθ
(k)
= arg max
θ∈Θ
log N(s(k)
; µ(θ), Σ(θ)) (5)
For each perturbation of θ the M-step performs a synthetic likelihood
simulation.
It returns the best found maximizer for (5) and corresponding best
( ˆµ, ˆΣ). Plug these in the updating moments equations (3)-(4).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
27. The slide that follows describes a single iteration of SAEM-SL.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
28. Input: observed summaries S(Y), positive integers L and R. Values for ˆθ
(k−1)
, ˆµ(k−1) and ˆΣ
(k−1)
.
Output: ˆθ
(k)
.
At iteration k:
1. Extract ˆµx, ˆµy, ˆΣx, ˆΣy, ˆΣxy and ˆΣyx from ˆµ(k−1) and ˆΣ
(k−1)
. Compute conditional moments ˆµx|y, ˆΣx|y.
2. Sample S(X(k−1))|S(Y) ∼ N( ˆµ
(k−1)
x|y
(θ), ˆΣ
(k−1)
x|y (θ)) and form s(k−1) := (S(Y), S(X(k−1))).
3. Obtain (θ(k), µ(k), Σ(k)) from InternalSL(s(k−1), ˆθ
(k−1)
, R) starting at ˆθ
(k−1)
.
4. Increase k := k + 1 and go to step 1.
Function InternalSL(s(k−1), θstart, R):
Input: s(k−1), starting parameters θstart, a positive integer R. Functions to compute simulated summaries S(y∗) and
S(x∗) must be available.
Output: the best found θ∗ maximizing log N(s(k); ˆµ, ˆΣ) and corresponding (µ∗, Σ∗).
Here θc denotes a generic candidate value.
i. Simulate x∗
r ∼ pX(X0:N ; θc), y∗
r ∼ pY|X(Y1:n|X1:n; θc) for r = 1, ..., R.
ii. Compute user-defined summaries s∗
r = (S(y∗
r ), S(x∗
r )) for r = 1, ..., R. Construct the corresponding ( ˆµ, ˆΣ).
iii. Evaluate log N(s(k); ˆµ, ˆΣ).
Use a numerical procedure that performs (i)–(iii) L times to find the best θ∗ maximizing log N(s(k); ˆµ, ˆΣ) for varying θc.
Denote with (µ∗, ˆΣ
∗
) the simulated moments corresponding to the best found θ∗. Set θ(k) := θ∗.
iv. Update moments:
ˆµ(k) = ˆµ(k−1) + γ(k)( ˆµ∗ − ˆµ(k−1))
ˆΣ
(k)
= ˆΣ
(k−1)
+ γ(k)( ˆΣ
∗
− ˆΣ
(k−1)
).
Return (θ(k), ˆµ(k), ˆΣ
(k)
).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
29. We have now completed all the steps required to implement a
likelihood free version of SAEM.
Main inference problem: not clear how to construct a set of
informative (S(Y), S(X)) for θ. These are user-defined, hence
arbitrary.
Main computational bottleneck: compared to the regular
SAEM, our M-step is a numerical optimization routine. We used
Nelder-Mead, which is rather slow.
Ideal case (typically unattainable)
If we have:
1 s = (S(Y), S(X)) is jointly sufficient for θ and
2 s is multivariate Gaussian
then our likelihood free SAEM converges to a stationary point of
p(Y; θ) under the conditions given in Delyon et al 1999.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
30. I have two examples to show:
a state-space model driven by an SDE: I compare SAEM-SL
with the regular SAEM and with direct optimzation of the
synthetic likelihood.
a simple Gaussian state-model: I compare SAEM-SML vs the
regular SAEM, iterated filtering and particle marginal methods.
A “static model” example is available in my paper3.
3
P. 2016. Likelihood-free stochastic approximation EM for inference in complex
models, arXiv:1609.03508.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
31. Example: a nonlinear Gaussian state-space model
We study a standard toy-model (e.g. Jasra et al. 20104).
Yj = Xj + σyνj, j 1
Xj = 2 sin(eXj−1 ) + σxτj,
with νj, τj ∼ N(0, 1) i.i.d. and X0 = 0.
θ = (σx, σy).
4
Jasra, Singh, Martin and McCoy, 2012. Filtering via approximate Bayesian
computation. Statistics and Computing.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
32. We generate n = 50 observations from the model with
σx = σy = 2.23.
0 5 10 15 20 25 30 35 40 45 50
time
-10
-5
0
5
10
Y
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
33. the standard SAEM
Let’s set-up the “standard” SAEM. We need the complete likelihood
and sufficient statistics.
Easy for this model.
p(Y, X) = p(Y|X)p(X) =
n
j=1
p(Yj|Xj)p(Xj|Xj−1)
Yj|Xj ∼ N(Xj, σ2
y)
Xj|Xj−1 ∼ N(2 sin(eXj−1
), σ2
x)
Sσ2
x
= n
j=1(Xj − 2 sin(eXj−1 ))2 and Sσ2
y
= n
j=1(Yj − Xj)2 are
sufficient for σ2
x and σ2
y
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
34. Plug the sufficient statistics in the complete (log)likelihood, and set to
zero the gradient w.r.t. (σ2
x, σ2
y).
Explicit M-step at kth iteration:
ˆσ2(k)
x = Sσ2
x
/n
ˆσ2(k)
y = Sσ2
y
/n
To run SAEM the only left thing needed is a way to sample X(k)|Y.
For this we use sequential Monte Carlo, e.g. the bootstrap filter (in
backup slides, if needed).
I skip this sampling step. Just know that this is easily accomplished
for state space models.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
35. SAEM-SL: SAEM with synthetic likelihoods
To implement SAEM-SL no knowledge of the complete likelihood is
required, nor analytic derivation of the sufficient statistics.
We just have to postulate some “reasonable” summaries for X and Y.
For each synthetic likelihood step, we simulate R = 500 realizations
of S(Xr) and S(Yr), containing:
the sample median of Xr, r = 1, ..., R;
the median absolute deviation of Xr;
the 10th, 20th, 75th and 90th percentile of Xr.
the sample median of Yr;
the median absolute deviation of Yr;
the 10th, 20th, 75th and 90th percentile of Yr.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
36. Results with SAEM-SL on 30 different datasets
Starting parameter values are randomly initialised. Here R = 500.
0 10 20 30 40 50 60 70
σ
x
-2
0
2
4
6
8
10
12
14
16
18
20
0 10 20 30 40 50 60 70
σ
y
0
5
10
15
20
25
30
Figure: trace plots for SAEM-SL (σx, left; σy, right) for the thirty estimation
procedures. Horizontal lines are true parameter values.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
37. (M, ¯M) (500,200) (1000,200) (1000,20)
σx (true value 2.23)
SAEM-SMC 2.54 [2.53,2.54] 2.55 [2.54,2.56] 1.99 [1.85,2.14]
IF2 1.26 [1.21,1.41] 1.35 [1.28,1.41] 1.35 [1.28,1.41]
σy (true value 2.23)
SAEM-SMC 0.11 [0.10,0.13] 0.06 [0.06,0.07] 1.23 [1.00,1.39]
IF2 1.62 [1.56,1.75] 1.64 [1.58,1.67] 1.64 [1.58,1.67]
Table: SAEM with bostrap filter using M particles; IF2=iterated filtering.
R 500 1000
σx (true value 2.23)
SAEM-SL 1.67 [0.42,1.97] 1.51 [0.82,2.03]
σy (true value 2.23)
SAEM-SL 2.40 [2.01,2.63] 2.27 [1.57,2.57]
Table: SAEM with synthetic likelihoods. K = 60 iterations.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
38. Example: state-space SDE model [P., 2016]
We consider a one-dimensional state-space model driven by a SDE.
Suppose we administer 4 mg of theophylline [Dose] to a subject.
Xt is the level of theophylline concentration in blood at time t (hrs).
Consider the following state-space model:
Yj = Xj + εj, εj ∼iid N(0, σ2
ε)
dXt = Dose·Ka·Ke
Cl e−Kat − KeXt dt + σ
√
XtdWt, t t0
Ke is the elimination rate constant
Ka is the absorption rate constant
Cl the clearance of the drug
σ the intensity of intrinsic stochastic noise.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
39. We simulate a set of n = 30 observations from the model at
equispaced times.
But how to simulate from this model? No analytic solution for the
SDE is available.
We resort to the Euler-Maruyama discretization with a small stepsize
h = 0.05 on the time interval [0,30]:
Xt+h = Xt +
Dose · Ka · Ke
Cl
e−Kat
− KeXt h + (σ h · Xt)Zt+h,
{Zt} ∼iid N(0, h)
This implies a latent simulated process of length N:
X0:N = {X0, Xh, ..., XN}.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
40. A typical relation of the process:
time (hrs)
0 5 10 15 20 25 30
0
2
4
6
8
10
12
14
Figure: data (circles) and the latent process (black line).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
41. The classic SAEM
Applying the “standard” SAEM is not really trivial here.
The complete likelihood:
p(Y, X) = p(Y|X)p(X) =
n
j=1
p(Yj|Xj)
N
i=1
p(Xi|Xi−1)
Yj|Xj ∼ N(Xj, σ2
y)
Xi|Xi−1 ∼ not available.
Euler-Maruyama induces a Gaussian approximation:
p(xi|xi−1) ≈
1
σ
√
2πxi−1h
exp −
xi − xi−1 − (Dose·Ka·Ke
Cl e−Kaτi−1
− Kexi−1)h
2
2σ2xi−1h
.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
42. The classic SAEM
I am not going to show how to obtain all the sufficient summary
statistics (see the paper).
Just trust me that it requires a bit of work.
And this is just a one-dimensional model!
We sample X(k)|Y using the bootstrap filter sequential Monte Carlo
method.
If you are not familiar with sequential Monte Carlo, worry not. Just
consider it a method returning a “best” filtered X(k) based on Y (for
linear Gaussian models you would use Kalman).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
43. SAEM-SL with synthetic likelihoods
User-defined summaries for a simulation r: (s(x∗
r ), s(y∗
r )).
s(x∗
r ) contains:
(i) the median values of X∗
0:N ;
(ii) the median absolute deviation of X∗
0:N,
(iii) a statistic for σ computed from X∗
0:N (see next slide).
(iv) ( j(Y∗
j − X∗
j )2/n)1/2.
s(y∗
r ) contains:
(i) the median value of y∗
r ;
(ii) its median absolute deviation;
(iii) the slope of the line connecting the first and last simulated
observation (Y∗
n − Y∗
1 )/(tn − t1).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
44. In Miao 2014: for an SDE of the type dXt = µ(Xt)dt + σg(Xt)dWt
with t ∈ [0, T], we have
Γ |Xi+1 − Xi|2
Γ g(Xi)(ti+1 − ti)
→ σ2
as |Γ| → 0
where the convergence is in probability and Γ a partition of [0, T].
We deduce that using the discretization {X0, X1, ..., XN} produced by
the Euler-Maruyama scheme, we can take the square root of the left
hand side in the limit above, which should be informative for σ.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
45. 100 different datasets are simulated from ground-truth parameters.
All optimizations start away from ground truth values.
SAEM-SL: at each iteration of the M-step simulates R = 500
summaries, with L = 10 Nelder-Mead iterations (M-step) and
K = 100 SAEM iterations.
0 20 40 60 80 100 120
Ke
0
0.05
0.1
0.15
0.2
0 20 40 60 80 100 120
Cl
0
0.05
0.1
0.15
0.2
0 20 40 60 80 100 120
σ
0
0.1
0.2
0.3
0 20 40 60 80 100 120
σǫ
0
0.2
0.4
0.6
0.8
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
46. SAEM-SMC using the bootstrap filter with M = 500 particles to
obtain a X(k)|Y.
Cl and σ are essentially unidentified.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
47. Ke Cl σ σε
true values 0.050 0.040 0.100 0.319
SAEM-SMC 0.045 [0.042,0.049] 0.085 [0.078,0.094] 0.171 [0.158,0.184] 0.395 [0.329,0.465]
SAEM-SL 0.044 [0.038,0.051] 0.033 [0.028,0.039] 0.106 [0.083,0.132] 0.266 [0.209,0.307]
optim. SL 0.063 [0.054,0.069] 0.089 [0.068,0.110] 0.304 [0.249,0.370] 0.543 [0.485,0.625]
SAEM-SMC: uses M = 500 particles to filter X(k)|Y via SMC. Runs
for K = 300 SAEM iterations.
SAEM-SL at each iteration of the M-step simulates R = 500
summaries, with L = 10 Nelder-Mead iterations (M-step) and
K = 100 SAEM iterations.
“optim. SL” denotes the direct maximization of Wood’s synthetic
(incomplete) likelihood:
ˆθ = arg max
θ∈Θ
log N(S(Y); µ(θ), Σ(θ)). (6)
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
48. How about Gaussianity of the summaries?
Here we have qq-normal plots from the 7 postulated summaries at the
obtained optimum (500 simulations each).
-4 -2 0 2 4
sx
(1)
4
6
8
10
12
-4 -2 0 2 4
sx
(2)
1.5
2
2.5
3
3.5
-4 -2 0 2 4
sx
(3)
1.8
2
2.2
2.4
-4 -2 0 2 4
sx
(4)
0.1
0.2
0.3
0.4
-4 -2 0 2 4
sy
(1)
4
6
8
10
12
-4 -2 0 2 4
sy
(2)
1.5
2
2.5
3
3.5
-4 -2 0 2 4
sy
(3)
-0.4
-0.3
-0.2
-0.1
The summaries quantiles nicely follow the line (not visible) for the
perfect match with Gaussian quantiles.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
49. Summary
We introduced SAEM-SL, a version of SAEM that is able to deal
with intractable likelihoods;
It only requires the formulation and simulation of “informative”
summaries s.
How to construct informative summaries automatically is a
difficult open problem.
if said user-defined summaries s are sufficient for θ (very
unlikely), and if s ∼ N(·) then SAEM-SL converges to the true
maximum likelihood estimates for p(Y|θ).
The method can be used for intractable models, or even just to
initialize starting values for more refined algorithms (e.g.
particle MCMC).
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
50. Key references
Andrieu et al. 2010. Particle Markov chain Monte Carlo methods.
JRSS-B.
Delyon, Lavielle and Moulines, 1999. Convergence of a stochastic
approximation version of the EM algorithm. Annals of Statistics.
Dempster, Laird and Rubin, 1977. Maximum likelihood from
incomplete data via the EM algorithm. JRSS-B.
Ionides et al. 2015. Inference for dynamic and latent variable models
via iterated, perturbed Bayes maps. PNAS.
Marin et al. 2012. Approximate Bayesian computational methods.
Stat. Comput.
Picchini 2016. Likelihood-free stochastic approximation EM for
inference in complex models, arXiv:1609.03508.
Wood 2010. Statistical inference for noisy nonlinear ecological
dynamic systems. Nature.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
52. Justification of Gaussianity (Wood 2010)
Assuming Gaussianity for summaries s(·) can be justified from a
standard Taylor expansion.
Say that fθ(s) is the true (unknown) joint density of s.
Expand fθ(s) around its mode µθ:
log fθ(s) ≈ log fθ(µθ) +
1
2
(s − µθ)
∂2 log fθ
∂s∂s
(s − µθ)
hence
fθ(s) ≈ const × exp −
1
2
(s − µθ) −
∂2 log fθ
∂s∂s
(s − µθ)
s ∼ N µθ, −
∂2 log fθ
∂s∂s
−1
, approximately when s ≈ µθ
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
53. Asymptotic properties for synthetic likelihoods (Wood
2010)
As the number of simulated statistics R → ∞
the maximizer ˆθ of liks(θ) is a consistent estimator.
ˆθ is an unbiased estimator.
ˆθ might not be in general Gaussian. It will be Gaussian if Σθ
depends weakly on θ or when d = dim(s) is large.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini
54. Algorithm 1 Bootstrap filter with M particles and threshold 1 ¯M
M. Resamples only when ESS < ¯M.
Step 0. Set j = 1: for m = 1, ..., M sample X
(m)
1 ∼ p(X0), compute weights
W
(m)
1 = f(Y1|X
(m)
1 ) and normalize weights w
(m)
1 := W
(m)
1 / M
m=1 W
(m)
1 .
Step 1.
if ESS({w
(m)
j }) < ¯M then
resample M particles {X
(m)
j , w
(m)
j } and set W
(m)
j = 1/M.
end if
Set j := j + 1 and if j = n + 1, stop and return all constructed weights
{W
(m)
j }m=1:M
j=1:n to sample a single path. Otherwise go to step 2.
Step 2. For m = 1, ..., M sample X
(m)
j ∼ p(·|X
(m)
j−1). Compute
W
(m)
j := w
(m)
j−1p(Yj|X
(m)
j )
normalize weights w
(m)
j := W
(m)
j / M
m=1 W
(m)
j and go to step 1.
Umberto Picchini umberto@maths.lth.se, twitter:@uPicchini