Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Estimation of the score vector and observed
information matrix in intractable models
Arnaud Doucet (University of Oxford)
...
Outline
1 Context
2 General results and connections
3 Posterior concentration when the prior concentrates
4 Hidden Markov ...
Outline
1 Context
2 General results and connections
3 Posterior concentration when the prior concentrates
4 Hidden Markov ...
Motivation
Derivatives of the likelihood help optimizing / sampling.
For many models they are not available.
One can resor...
Using derivatives in sampling algorithms
Modified Adjusted Langevin Algorithm
At step t, given a point θt, do:
propose
θ⋆
∼...
Using derivatives in sampling algorithms
Figure : Proposal mechanism for random walk Metropolis–Hastings.
Pierre Jacob Der...
Using derivatives in sampling algorithms
Figure : Proposal mechanism for MALA.
Pierre Jacob Derivative estimation 5/ 39
Using derivatives in sampling algorithms
In what sense is MALA better than MH?
Scaling with the dimension of the state spa...
Hidden Markov models
y2
X2X0
y1
X1
...
... yT
XT
θ
Figure : Graph representation of a general hidden Markov model.
Hidden ...
Assumptions
Input:
Parameter θ : unknown, prior distribution p.
Initial condition µθ(dx0) : can be sampled from.
Transitio...
Why is it an intractable model?
The likelihood function does not admit a closed form expression:
L(θ; y1, . . . , yT ) =
∫...
Fisher and Louis’ identities
Write the score as:
∇ℓ(θ) =
∫
∇ log p(x0:T , y1:T | θ)p(dx0:T | y1:T , θ).
which is an integr...
New kid on the block: Iterated Filtering
Perturbed model
Hidden states ˜Xt = (˜θt, Xt).
{
˜θ0 ∼ N(θ0, τ2Σ)
X0 ∼ µ˜θ0
(·)
a...
Iterated Filtering: the mystery
Why is it valid?
Is it related to known techniques?
Can it be extended to estimate the sec...
Outline
1 Context
2 General results and connections
3 Posterior concentration when the prior concentrates
4 Hidden Markov ...
Iterated Filtering
Given a log likelihood ℓ and a given point, consider a prior
θ ∼ N(θ0, σ2
).
Posterior expectation when...
Extension of Iterated Filtering
Posterior variance when the prior variance goes to zero
Second-order moments give second-o...
Proximity mapping
Given a real function f and a point θ0, consider for any σ2 > 0
θ → f (θ) exp
{
−
1
2σ2
(θ − θ0)2
}
θ
θ0...
Proximity mapping
Proximity mapping
The σ2-proximity mapping is defined by
proxf : θ0 → argmaxθ∈R f (θ) exp
{
−
1
2σ2
(θ − ...
Proximity mapping
θ
Figure : θ → f (θ) and θ → fσ2 (θ) for three values of σ2
.
Pierre Jacob Derivative estimation 16/ 39
Proximity mapping
Property
Those objects are such that
proxf (θ0) − θ0
σ2
= ∇ log fσ2 (θ0) −−−→
σ2→0
∇ log f (θ0)
Moreau (...
Proximity mapping
Bayesian interpretation
If f is a seen as a likelihood function then
θ → f (θ) exp
{
−
1
2σ2
(θ − θ0)2
}...
Stein’s lemma
Stein’s lemma states that
θ ∼ N(θ0, σ2
)
if and only if for any function g such that E [|∇g(θ)|] < ∞,
E [(θ ...
Stein’s lemma
For the second derivative, we consider
h : θ → (θ − θ0) exp ℓ (θ) /Z.
Then
E
[
(θ − θ0)2
| Y
]
= σ2
+ σ4
E
[...
Outline
1 Context
2 General results and connections
3 Posterior concentration when the prior concentrates
4 Hidden Markov ...
Core Idea
The prior is a essentially a normal distribution N(θ0, σ2), but in
general has a density denoted by κ.
Posterior...
Details
Introduce a test function h such that |h(u)| < c|u|α for some
c, α.
We start by writing
E {h (θ − θ0)| y} =
∫
h (σ...
Details
For the numerator:
∫
h (σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ (u) du
we use a Taylor expansion of ℓ around θ0 and a Tayl...
Details
Main expansion:
∫
h(σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ(u)du =
∫
h(σu)κ(u)du + σ
∫
h(σu)ℓ(1)
(θ0).u κ(u)du
+ σ2
∫
h(σu...
Details
We cut the integral into two bits:
∫
h(σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ(u)du
=
∫
σ|u|≤ρ
h(σu) exp {ℓ (θ0 + σu) − ℓ(...
Details
To get the score from the expansion, choose
h : u → u.
To get the observed information matrix from the
expansion, ...
Outline
1 Context
2 General results and connections
3 Posterior concentration when the prior concentrates
4 Hidden Markov ...
Hidden Markov models
y2
X2X0
y1
X1
...
... yT
XT
θ
Figure : Graph representation of a general hidden Markov model.
Pierre ...
Hidden Markov models
Direct application of the previous results
1 Prior distribution N(θ0, σ2) on the parameter θ.
2 The d...
Iterated Filtering
Modification of the model: θ is time-varying.
The associated loglikelihood is
¯ℓ(θ1:T ) = log p(y1:T ; θ...
Iterated Filtering
Choice of prior on θ1:T :
θ1 = θ0 + V1, V1 ∼ τ−1
κ
{
τ−1
(·)
}
θt+1 − θ0 = ρ
(
θt − θ0
)
+ Vt+1, Vt+1 ∼...
Iterated Filtering
Applying the general results for this prior yields, with
|x| =
∑T
t=1 |xi|:
|∇¯ℓ(θ
[T]
0 ) − Σ−1
T
(
E
...
Iterated Filtering
The estimator of the score is thus given by
T∑
t=1
{
Σ−1
T
(
E
[
θ1:T | Y
]
− θ
[T]
0
)}
t
which can be...
Iterated Filtering
If ρ = 1, then the parameters follow a random walk:
θ1 = θ0 + N(0, τ2
) and θt+1 = θt + N(0, σ2
).
In t...
Iterated Filtering
If ρ = 0, then the parameters are i.i.d:
θ1 = θ0 + N(0, τ2
) and θt+1 = θ0 + N(0, τ2
).
In this case th...
Iterated Smoothing
Only for the case ρ = 0 are we able to obtain simple expressions
for the observed information matrix. W...
Numerical results
Linear Gaussian state space model where the ground truth is
available through the Kalman filter.
X0 ∼ N(0...
Numerical results
q
q
q
q
q
q
q q q
q q q q q q q q q qq
q
q
qq
q
q
q
q q q q q q q q q q q
q
q
q
q
q
q
q
q
q
q
q q q
q q ...
Numerical results
qqqqqqq q q q q q q q q q q q q q
qqqqqqq q q q q q q q q q q q q q
qqqqqqq q q q q q q q q q q q q q
q
...
Bibliography
Main references:
Inference for nonlinear dynamical systems, Ionides, Breto,
King, PNAS, 2006.
Iterated filteri...
Upcoming SlideShare
Loading in …5
×

Estimation of the score vector and observed information matrix in intractable models

299 views

Published on

Talk on derivative estimatorsd, Bristol, 28 November 2014.

Published in: Science
  • Be the first to comment

  • Be the first to like this

Estimation of the score vector and observed information matrix in intractable models

  1. 1. Estimation of the score vector and observed information matrix in intractable models Arnaud Doucet (University of Oxford) Pierre E. Jacob (University of Oxford) Sylvain Rubenthaler (Universit´e Nice Sophia Antipolis) November 28th, 2014 Pierre Jacob Derivative estimation 1/ 39
  2. 2. Outline 1 Context 2 General results and connections 3 Posterior concentration when the prior concentrates 4 Hidden Markov models Pierre Jacob Derivative estimation 1/ 39
  3. 3. Outline 1 Context 2 General results and connections 3 Posterior concentration when the prior concentrates 4 Hidden Markov models Pierre Jacob Derivative estimation 2/ 39
  4. 4. Motivation Derivatives of the likelihood help optimizing / sampling. For many models they are not available. One can resort to approximation techniques. Pierre Jacob Derivative estimation 2/ 39
  5. 5. Using derivatives in sampling algorithms Modified Adjusted Langevin Algorithm At step t, given a point θt, do: propose θ⋆ ∼ q(dθ | θt) ≡ N(θt + σ2 2 ∇θ log π(θt), σ2 ), with probability 1 ∧ π(θ⋆)q(θt | θ⋆) π(θt)q(θ⋆ | θt) set θt+1 = θ⋆, otherwise set θt+1 = θt. Pierre Jacob Derivative estimation 3/ 39
  6. 6. Using derivatives in sampling algorithms Figure : Proposal mechanism for random walk Metropolis–Hastings. Pierre Jacob Derivative estimation 4/ 39
  7. 7. Using derivatives in sampling algorithms Figure : Proposal mechanism for MALA. Pierre Jacob Derivative estimation 5/ 39
  8. 8. Using derivatives in sampling algorithms In what sense is MALA better than MH? Scaling with the dimension of the state space For Metropolis–Hastings, optimal scaling leads to σ2 = O(d−1 ), For MALA, optimal scaling leads to σ2 = O(d−1/3 ). Roberts & Rosenthal, Optimal Scaling for Various Metropolis-Hastings Algorithms, 2001. Pierre Jacob Derivative estimation 6/ 39
  9. 9. Hidden Markov models y2 X2X0 y1 X1 ... ... yT XT θ Figure : Graph representation of a general hidden Markov model. Hidden process: initial distribution µθ, transition fθ. Observations conditional upon the hidden process, from gθ. Pierre Jacob Derivative estimation 7/ 39
  10. 10. Assumptions Input: Parameter θ : unknown, prior distribution p. Initial condition µθ(dx0) : can be sampled from. Transition fθ(dxt|xt−1) : can be sampled from. Measurement gθ(yt|xt) : can be evaluated point-wise. Observations y1:T = (y1, . . . , yT ). Goals: score: ∇θ log L(θ; y1:T ) for any θ, observed information matrix: −∇2 θ log L(θ; y1:T ) for any θ. Note: throughout the talk, the observations, and thus the likelihood, are fixed. Pierre Jacob Derivative estimation 8/ 39
  11. 11. Why is it an intractable model? The likelihood function does not admit a closed form expression: L(θ; y1, . . . , yT ) = ∫ XT+1 p(y1, . . . , yT | x0, . . . xT , θ)p(dx0, . . . dxT | θ) = ∫ XT+1 T∏ t=1 gθ(yt | xt) µθ(dx0) T∏ t=1 fθ(dxt | xt−1). Hence the likelihood can only be estimated, e.g. by standard Monte Carlo, or by particle filters. What about the derivatives of the likelihood? Pierre Jacob Derivative estimation 9/ 39
  12. 12. Fisher and Louis’ identities Write the score as: ∇ℓ(θ) = ∫ ∇ log p(x0:T , y1:T | θ)p(dx0:T | y1:T , θ). which is an integral, with respect to the smoothing distribution p(dx0:T | y1:T , θ), of ∇ log p(x0:T , y1:T | θ) = ∇ log µθ(x0) + T∑ t=1 ∇ log fθ(xt | xt−1) + T∑ t=1 ∇ log gθ(yt | xt). However pointwise evaluations of ∇ log µθ(x0) and ∇ log fθ(xt | xt−1) are not always available. Pierre Jacob Derivative estimation 10/ 39
  13. 13. New kid on the block: Iterated Filtering Perturbed model Hidden states ˜Xt = (˜θt, Xt). { ˜θ0 ∼ N(θ0, τ2Σ) X0 ∼ µ˜θ0 (·) and { ˜θt ∼ N(˜θt−1, σ2Σ) Xt ∼ f˜θt (· | Xt−1 = xt−1) Observations ˜Yt ∼ g˜θt (· | Xt). Score estimate Consider VP,t = Cov[˜θt | y1:t−1] and ˜θF,t = E[˜θt | y1:t]. T∑ t=1 VP,t −1 ( ˜θF,t − ˜θF,t−1 ) ≈ ∇ℓ(θ0) when τ → 0 and σ/τ → 0. Ionides, Breto, King, PNAS, 2006. Pierre Jacob Derivative estimation 11/ 39
  14. 14. Iterated Filtering: the mystery Why is it valid? Is it related to known techniques? Can it be extended to estimate the second derivatives (i.e. the Hessian, i.e. the observed information matrix)? How does it compare to other methods such as finite difference? Pierre Jacob Derivative estimation 12/ 39
  15. 15. Outline 1 Context 2 General results and connections 3 Posterior concentration when the prior concentrates 4 Hidden Markov models Pierre Jacob Derivative estimation 13/ 39
  16. 16. Iterated Filtering Given a log likelihood ℓ and a given point, consider a prior θ ∼ N(θ0, σ2 ). Posterior expectation when the prior variance goes to zero First-order moments give first-order derivatives: |σ−2 (E[θ|Y ] − θ0) − ∇ℓ(θ0)| ≤ Cσ2 . Phrased simply, posterior mean − prior mean prior variance ≈ score. Result from Ionides, Bhadra, Atchad´e, King, Iterated filtering, 2011. Pierre Jacob Derivative estimation 13/ 39
  17. 17. Extension of Iterated Filtering Posterior variance when the prior variance goes to zero Second-order moments give second-order derivatives: |σ−4 ( Cov[θ|Y ] − σ2 ) − ∇2 ℓ(θ0)| ≤ Cσ2 . Phrased simply, posterior variance − prior variance prior variance2 ≈ hessian. Result from Doucet, Jacob, Rubenthaler on arXiv, 2013. Pierre Jacob Derivative estimation 14/ 39
  18. 18. Proximity mapping Given a real function f and a point θ0, consider for any σ2 > 0 θ → f (θ) exp { − 1 2σ2 (θ − θ0)2 } θ θ0 Figure : Example for f : θ → exp(−|θ|) and three values of σ2 . Pierre Jacob Derivative estimation 15/ 39
  19. 19. Proximity mapping Proximity mapping The σ2-proximity mapping is defined by proxf : θ0 → argmaxθ∈R f (θ) exp { − 1 2σ2 (θ − θ0)2 } . Moreau approximation The σ2-Moreau approximation is defined by fσ2 : θ0 → C supθ∈R f (θ) exp { − 1 2σ2 (θ − θ0)2 } where C is a normalizing constant. Pierre Jacob Derivative estimation 16/ 39
  20. 20. Proximity mapping θ Figure : θ → f (θ) and θ → fσ2 (θ) for three values of σ2 . Pierre Jacob Derivative estimation 16/ 39
  21. 21. Proximity mapping Property Those objects are such that proxf (θ0) − θ0 σ2 = ∇ log fσ2 (θ0) −−−→ σ2→0 ∇ log f (θ0) Moreau (1962), Fonctions convexes duales et points proximaux dans un espace Hilbertien. Pereyra (2013), Proximal Markov chain Monte Carlo algorithms. Pierre Jacob Derivative estimation 17/ 39
  22. 22. Proximity mapping Bayesian interpretation If f is a seen as a likelihood function then θ → f (θ) exp { − 1 2σ2 (θ − θ0)2 } is an unnormalized posterior density function based on a Normal prior with mean θ0 and variance σ2. Hence proxf (θ0) − θ0 σ2 −−−→ σ2→0 ∇ log f (θ0) can be read posterior mode − prior mode prior variance ≈ score. Pierre Jacob Derivative estimation 18/ 39
  23. 23. Stein’s lemma Stein’s lemma states that θ ∼ N(θ0, σ2 ) if and only if for any function g such that E [|∇g(θ)|] < ∞, E [(θ − θ0) g (θ)] = σ2 E [∇g (θ)] . If we choose the function g : θ → exp ℓ (θ) /Z with Z = E [exp ℓ (θ)] and apply Stein’s lemma we obtain 1 Z E [θ exp ℓ(θ)] − θ0 = σ2 Z E [∇ℓ (θ) exp (ℓ (θ))] ⇔ σ−2 (E [θ | Y ] − θ0) = E [∇ℓ (θ) | Y ] . Notation: E[φ(θ) | Y ] := E[φ(θ) exp ℓ(θ)]/Z. Pierre Jacob Derivative estimation 19/ 39
  24. 24. Stein’s lemma For the second derivative, we consider h : θ → (θ − θ0) exp ℓ (θ) /Z. Then E [ (θ − θ0)2 | Y ] = σ2 + σ4 E [ ∇2 ℓ(θ) + ∇ℓ(θ)2 | Y ] . Adding and subtracting terms also yields σ−4 ( V [θ | Y ] − σ2 ) = E [ ∇2 ℓ(θ) | Y ] + { E [ ∇ℓ(θ)2 | Y ] − (E [∇ℓ(θ) | Y ])2 } . . . . but what we really want is ∇ℓ(θ0), ∇ℓ2 (θ0) and not E [∇ℓ(θ) | Y ] , E [ ∇ℓ2 (θ) | Y ] . Pierre Jacob Derivative estimation 20/ 39
  25. 25. Outline 1 Context 2 General results and connections 3 Posterior concentration when the prior concentrates 4 Hidden Markov models Pierre Jacob Derivative estimation 21/ 39
  26. 26. Core Idea The prior is a essentially a normal distribution N(θ0, σ2), but in general has a density denoted by κ. Posterior concentration induced by the prior Under some assumptions, when σ → 0: the posterior looks more and more like the prior, the shift in posterior moments is in O(σ2). Our arXived proof suffers from an overdose of Taylor expansions. Pierre Jacob Derivative estimation 21/ 39
  27. 27. Details Introduce a test function h such that |h(u)| < c|u|α for some c, α. We start by writing E {h (θ − θ0)| y} = ∫ h (σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ (u) du ∫ exp {ℓ (θ0 + σu) − ℓ(θ0)} κ (u) du using u = (θ − θ0)/σ and then focus on the numerator ∫ h (σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ (u) du since the denominator is a particular instance of this expression with h : u → 1. Pierre Jacob Derivative estimation 22/ 39
  28. 28. Details For the numerator: ∫ h (σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ (u) du we use a Taylor expansion of ℓ around θ0 and a Taylor expansion of exp around 0, and then take the integral with respect to κ. Notation: ℓ(k) (θ).u⊗k = ∑ 1≤i1,...,ik≤d ∂kℓ(θ) ∂θi1 . . . ∂θik ui1 . . . uik which in one dimension becomes ℓ(k) (θ).u⊗k = dkf (θ) dθk uk . Pierre Jacob Derivative estimation 23/ 39
  29. 29. Details Main expansion: ∫ h(σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ(u)du = ∫ h(σu)κ(u)du + σ ∫ h(σu)ℓ(1) (θ0).u κ(u)du + σ2 ∫ h(σu) { 1 2 ℓ(2) (θ0).u⊗2 + 1 2 (ℓ(1) (θ0).u)2 } κ(u)du + σ3 ∫ h(σu) { 1 3! (ℓ(1) (θ0).u)3 + 1 2 (ℓ(1) (θ0).u)(ℓ(2) (θ0).u⊗2 ) + 1 3! ℓ(3) (θ0).u⊗3 } κ(u)du + O(σ4+α ). In general, assumptions on the tails of the prior and the likelihood are used to control the remainder terms and to ensure there are O(σ4+α). Pierre Jacob Derivative estimation 24/ 39
  30. 30. Details We cut the integral into two bits: ∫ h(σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ(u)du = ∫ σ|u|≤ρ h(σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ(u)du + ∫ σ|u|>ρ h(σu) exp {ℓ (θ0 + σu) − ℓ(θ0)} κ(u)du The expansion stems from the first term, where σ|u| is small. The second term ends up in the remainder in O(σ4+α) using the assumptions. Classic technique in Bayesian asymptotics theory. Pierre Jacob Derivative estimation 25/ 39
  31. 31. Details To get the score from the expansion, choose h : u → u. To get the observed information matrix from the expansion, choose h : u → u2 , and surprisingly (?) further assume that κ is mesokurtic, i.e. ∫ u4 κ(u)du = 3 (∫ u2 κ(u)du )2 ⇒ choose a Gaussian prior to obtain the hessian. Pierre Jacob Derivative estimation 26/ 39
  32. 32. Outline 1 Context 2 General results and connections 3 Posterior concentration when the prior concentrates 4 Hidden Markov models Pierre Jacob Derivative estimation 27/ 39
  33. 33. Hidden Markov models y2 X2X0 y1 X1 ... ... yT XT θ Figure : Graph representation of a general hidden Markov model. Pierre Jacob Derivative estimation 27/ 39
  34. 34. Hidden Markov models Direct application of the previous results 1 Prior distribution N(θ0, σ2) on the parameter θ. 2 The derivative approximations involve E[θ|Y ] and Cov[θ|Y ]. 3 Posterior moments for HMMs can be estimated by particle MCMC, SMC2 , ABC or your favourite method. Ionides et al. proposed another approach. Pierre Jacob Derivative estimation 28/ 39
  35. 35. Iterated Filtering Modification of the model: θ is time-varying. The associated loglikelihood is ¯ℓ(θ1:T ) = log p(y1:T ; θ1:T ) = log ∫ XT+1 T∏ t=1 g(yt | xt, θt) µ(dx1 | θ1) T∏ t=2 f (dxt | xt−1, θt). Introducing θ → (θ, θ, . . . , θ) := θ[T] ∈ RT , we have ¯ℓ(θ[T] ) = ℓ(θ) and the chain rule yields dℓ(θ) dθ = T∑ t=1 ∂¯ℓ(θ[T]) ∂θt . Pierre Jacob Derivative estimation 29/ 39
  36. 36. Iterated Filtering Choice of prior on θ1:T : θ1 = θ0 + V1, V1 ∼ τ−1 κ { τ−1 (·) } θt+1 − θ0 = ρ ( θt − θ0 ) + Vt+1, Vt+1 ∼ σ−1 κ { σ−1 (·) } Choose σ2 such that τ2 = σ2/(1 − ρ2). Covariance of the prior on θ1:T : ΣT = τ2             1 ρ · · · · · · · · · ρT−1 ρ 1 ρ · · · · · · ρT−2 ρ2 ρ 1 ... ρT−3 ... ... ... ... ... ρT−2 ... 1 ρ ρT−1 · · · · · · · · · ρ 1             . Pierre Jacob Derivative estimation 30/ 39
  37. 37. Iterated Filtering Applying the general results for this prior yields, with |x| = ∑T t=1 |xi|: |∇¯ℓ(θ [T] 0 ) − Σ−1 T ( E [ θ1:T | Y ] − θ [T] 0 ) | ≤ Cτ2 Moreover we have T∑ t=1 ∂¯ℓ(θ[T]) ∂θt − T∑ t=1 { Σ−1 T ( E [ θ1:T | Y ] − θ [T] 0 )} t ≤ T∑ t=1 ∂¯ℓ(θ[T]) ∂θt − { Σ−1 T ( E [ θ1:T | Y ] − θ [T] 0 )} t and dℓ(θ) dθ = T∑ t=1 ∂¯ℓ(θ[T]) ∂θt . Pierre Jacob Derivative estimation 31/ 39
  38. 38. Iterated Filtering The estimator of the score is thus given by T∑ t=1 { Σ−1 T ( E [ θ1:T | Y ] − θ [T] 0 )} t which can be reduced to Sτ,ρ,T (θ0) = τ−2 1 + ρ [ (1 − ρ) {T−1∑ t=2 E ( θt Y ) } − {(1 − ρ) T + 2ρ} θ0 +E ( θ1 Y ) + E ( θT Y )] , given the form of Σ−1 T . Note that in the quantities E(θt | Y ), Y = Y1:T is the complete dataset, thus those expectations are with respect to the smoothing distribution. Pierre Jacob Derivative estimation 32/ 39
  39. 39. Iterated Filtering If ρ = 1, then the parameters follow a random walk: θ1 = θ0 + N(0, τ2 ) and θt+1 = θt + N(0, σ2 ). In this case Ionides et al. proposed the estimator Sτ,σ,T = τ−2 ( E ( θT | Y ) − θ0 ) as well as S (bis) τ,σ,T = T∑ t=1 VP,t −1 ( ˜θF,t − ˜θF,t−1 ) with VP,t = Cov[˜θt | y1:t−1] and ˜θF,t = E[θt | y1:t]. Those expressions only involve expectations with respect to filtering distributions. Pierre Jacob Derivative estimation 33/ 39
  40. 40. Iterated Filtering If ρ = 0, then the parameters are i.i.d: θ1 = θ0 + N(0, τ2 ) and θt+1 = θ0 + N(0, τ2 ). In this case the expression of the score estimator reduces to Sτ,T = τ−2 T∑ t=1 ( E ( θt | Y ) − θ0 ) which involves smoothing distributions. There’s only one parameter τ2 to choose for the prior. However smoothing for general hidden Markov models is difficult, and typically resorts to “fixed lag approximations”. Pierre Jacob Derivative estimation 34/ 39
  41. 41. Iterated Smoothing Only for the case ρ = 0 are we able to obtain simple expressions for the observed information matrix. We propose the following estimator: Iτ,T (θ0) = −τ−4 { T∑ s=1 T∑ t=1 Cov ( θs, θt Y ) − τ2 T } . for which we can show that Iτ,T − (−∇2 ℓ(θ0)) ≤ Cτ2 . Pierre Jacob Derivative estimation 35/ 39
  42. 42. Numerical results Linear Gaussian state space model where the ground truth is available through the Kalman filter. X0 ∼ N(0, 1) and Xt = ρXt−1 + N(0, V ) Yt = ηXt + N(0, W ). Generate T = 100 observations and set ρ = 0.9, V = 0.7, η = 0.9 and W = 0.1, 0.2, 0.4, 0.9. 240 independent runs, matching the computational costs between methods in terms of number of calls to the transition kernel. Pierre Jacob Derivative estimation 36/ 39
  43. 43. Numerical results q q q q q q q q q q q q q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 10 100 1000 10000 0.0 0.1 0.2 0.3 h RMSE parameter q q q q 1 2 3 4 Finite Difference qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 10 100 1000 10000 0.1 0.2 0.3 0.4 0.5 tau RMSE parameter q q q q 1 2 3 4 Iterated Smoothing Figure : 240 runs for Iterated Smoothing and Finite Difference. Pierre Jacob Derivative estimation 37/ 39
  44. 44. Numerical results qqqqqqq q q q q q q q q q q q q q qqqqqqq q q q q q q q q q q q q q qqqqqqq q q q q q q q q q q q q q q qqqqqq q q q q q q q q q q q q q 10 100 1000 10000 0.1 0.2 0.3 0.4 0.5 tau RMSE parameter q q q q 1 2 3 4 Iterated Smoothing qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 10 100 1000 10000 0.0 0.1 0.2 0.3 0.4 0.5 tau RMSE parameter q q q q 1 2 3 4 Iterated Filtering 1 qqqq q q q q q q qqqq q q q q q q qqqq q q q q q q qqqq q q q q q q 10 100 1000 10000 0.0 0.1 0.2 0.3 0.4 0.5 tau RMSE parameter q q q q 1 2 3 4 Iterated Filtering 2 Figure : 240 runs for Iterated Smoothing and Iterated Filtering. Pierre Jacob Derivative estimation 38/ 39
  45. 45. Bibliography Main references: Inference for nonlinear dynamical systems, Ionides, Breto, King, PNAS, 2006. Iterated filtering, Ionides, Bhadra, Atchad´e, King, Annals of Statistics, 2011. Efficient iterated filtering, Lindstr¨om, Ionides, Frydendall, Madsen, 16th IFAC Symposium on System Identification. Derivative-Free Estimation of the Score Vector and Observed Information Matrix, Doucet, Jacob, Rubenthaler, 2013 (on arXiv). Pierre Jacob Derivative estimation 39/ 39

×