CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar, May 16, 2018
1. 1/28
Semiparametric Models for Analyzing Extremes
Surya T Tokdar and Erika Cunningham
Duke University
Thanks to: Whitney Huang, Michael Stein, Michael Wehner and others in the Extremes Semiparametric Subgroup
4. 4/28
Motivation behind POT
The main motivation is to obtain low-dimensional parametric
estimation that focuses primarily on the tail decay rate which
may be quantified by the tail index1 ξ
By Pickands-Balkema-de Haan Theorem, the truncated CDF
above a large threshold is well approximated by a generalized
Pareto distribution with matching tail index ξ
1
For polynomially decaying tails, ξ = limy→∞
− log(1−F(y))
log y
5. 5/28
Issues with POT
Setting the correct/optimal threshold is extremely challenging
POT is difficult to extend as a model for more complex data
with spatio-temporal dependence or other structures
Our goal: develop a semiparametric model for heavy tailed
data where the tails are estimated under parametric
assumptions whereas the center is estimated
nonparametrically!
7. 7/28
Transformation
Setting.
Data range = (a, b), with a = −∞ and/or b = ∞
{gθ : θ ∈ Θ} a parametric family of pdfs on (a, b)
Gθ denotes the CDF of gθ
Lemma
For any pdf f on (a, b) and any θ ∈ Θ there exists a unique pdf
h = hθ,f on (0, 1) such that
f (y) = gθ(y)h(Gθ(y)), y ∈ (a, b).
Proof. Take Y ∼ f and take h to be the pdf of U = Gθ(Y )
8. 8/28
Tail matching
Suppose gθ and f are continuous densities
Then h = hθ,f is continuous and the two limits
lim
y a
f (y)
gθ(y)
= lim
u 0
h(u) =: h(0),
lim
y b
f (y)
gθ(y)
= lim
u 1
h(u) =: h(1)
exist but could equal 0 or ∞.
Corollary
f and gθ have same right and/or left tail index if and only if
0 < h(1) < ∞ and/or 0 < h(0) < ∞
9. 9/28
Tail-identified transformation
Definition
The family {gθ : θ ∈ Θ} is tail-identified if θ = θ implies gθ and
gθ have distinct right and/or left tail indices.
Lemma
If {gθ : θ ∈ Θ} is tail-identified then for any pdf f on (a, b) there is
at most one θf ∈ Θ with h = hθf ,f satisfying 0 < h(0), h(1) < ∞.
10. 10/28
Semiparametric density model for bulk + tail
{gθ : θ ∈ Θ} a tail-identified family
H := {h(·) a cont pdf on [0, 1] : 0 < h(0), h(1) < ∞}
F := {f (·) = gθ(·)h(Gθ(·)) : θ ∈ Θ, h ∈ H}
Model: Y1, Y2, . . .
IID
∼ f , f ∈ F
12. 12/28
Logistic GP prior on H
Definition (The logistic transform)
L : C([0, 1]) → H given by
(Lw)(u) =
ew(u)
1
0 ew(t)dt
, u ∈ [0, 1].
Definition (The logistic GP)
LGP(µ, σ) = L∗GP(µ, σ)
I.e., h ∼ LGP(µ, σ) ⇐⇒ h = Lw with w ∼ GP(µ, σ)
13. 13/28
sLGP heavy-tailed density estimation on R
Model: Y1, . . . , Yn ∼ f (·) = gθ(·)h(Gθ(·))
h ∼ LGP(0, κ2CSE
λ ), (κ2, λ2) ∼ Ga−1
× Ga−1
gθ = tν(µ, τ2), θ = (µ, τ2, ν) ∼ 1
τ2 × πν(ν)
14. 14/28
gθ = t3(0, 1), λ = 0.3 (top) and 0.08 (bottom)
0.0 0.2 0.4 0.6 0.8 1.0
−2−10123
w
u
w(u)
0.0 0.2 0.4 0.6 0.8 1.0
0.51.01.52.0
h
u
h(u)
−4 −2 0 2 4
0.00.20.40.6
f
y
f(y)
0.0 0.2 0.4 0.6 0.8 1.0
−2−1012
w
u
w(u)
0.0 0.2 0.4 0.6 0.8 1.0
012345
h
u
h(u)
−4 −2 0 2 4
0.00.51.01.5
f
y
f(y)
15. 15/28
Model fitting
Low-rank approximation to GP w
Discretization of length-scale λ over a dense grid
Precomputed covariance matrix + Cholesky factors
Adaptive Metropolis MCMC
18. 18/28
Simulation
Simulation setup
Mixture standard normal and centered t4.
100 data sets, n=2000
−10 −5 0 5 10
0.000.050.100.150.200.25
Simulation 2
x
Density
Mixture centered normal & t with 4 df
Check
1. Parameters of gθ, specifically, tail index ν−1
2. Estimation of high (and low) quantiles
19. 19/28
Results
LGP Tail Index Estimation
Parameter True Mean Estimates Coverage 95% CI
ν 4 3.76 88%
ξ 0.25 0.27 88%
Comparison to Generalized Pareto Distribution (GPD)
Fit GPD to absolute values over Q0.975; expect n=100 above
Fit using Maximum likelihood
20. 20/28
Extreme quantiles
Upper-tail Bias and Ratio of RMSE
0
5
10
0.99 0.999 0.9999 0.99999
p
QuantileBias
Estimator
GPD
LGP
0.9
1.0
1.1
1.2
1.3
0.99 0.999 0.9999 0.99999
p
QuantileRMSERatio(GPD/LGP)
GPD has lower quantile RMSE just beyond threshold (p=0.98, 0.99)
Otherwise, LGP has lower RMSE in extrapolated tails
21. 21/28
Transition from parametric to non-parametric
F−1(p) ≈ G−1
θ ( p
h(0)), and, F−1(1 − p) ≈ G−1
θ (1 − p
h(1)), p ≈ 0
−20
−10
0
0.001 0.01 0.1
p
Quantile
Quantile
LGP CI
Parametric CI
True
Dataset 1, Lower Quantiles
0
10
20
0.9 0.99 0.999
p
Quantile
Quantile
LGP CI
Parametric
True
Dataset 1, Upper Quantiles
22. 22/28
Conclusions and ongoing work
Summary
sLGP approach provides a promising alternative to POT
Despite bias, sLGP reduces variance sufficiently in quantile
estimation to provide lower RMSE than GPD
Ongoing work
Change gθ for one with different indices in each tail
Get theory results for estimation of θ
Extend sLGP to multivariate, time-series, spatio-temproal,
regression etc.
Application to wind speed vs direction analysis
25. 25/28
Asymmetric t: approach 2
Could use gα,ν(y) = g(y)hα,β(G(y)) with
1. g(y) = tν0
2. hα,β is the Be(α, β) pdf
Tail-index:
Left: (αν0)−1
Right: (βν0)−1
26. 26/28
Theory for sLGP
Theorem
If true f ∗ matches model: f ∗(y) = gθ∗ (y)h∗(Gθ∗ (y)) with
1. log h∗ ∈ Cα([0, 1]) and
2. {gθ : θ ∈ Θ} is regular at θ∗ (e.g., Cram´er conditions)
then
plim
n→∞
Π f − f ∗
1 ≥ Mnn− α
2α+1 (log n)q
|Y1, . . . , Yn = 0
for any Mn → ∞ with q = (4α + 1)/(4α + 2)
What about estimation of θ?
Is uncertainty suppressed for extreme quantiles?
27. 27/28
Extension to dependent data
Primary idea is to use copula, which offers a conceptually
simple and sound extension and keeps model fitting relatively
simple.
There are concerns about appropriate choice of copulas –
particularly in terms of what tail dependence models are
derived from them
28. 28/28
Wind analysis
Use polar transformation to represent data as wind direction and
wind speed and model the latter as a possibly heavy tailed data