CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar, May 16, 2018

1/28
Semiparametric Models for Analyzing Extremes
Surya T Tokdar and Erika Cunningham
Duke University
Thanks to: Whitney Huang, Michael Stein, Michael Wehner and others in the Extremes Semiparametric Subgroup

2/28
Thresholding for Extreme Analysis

3/28
Analyzing extremes
How to predict 1000-year ﬂood from limited data?
0 2000 4000 6000 8000
020406080
Index
Dailyrainfallfornon−drydays(mm)
●● ●
●
●
●
●
●
●●
●
●
●
● ● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●●
●
●●●
●
●
●
●●
●
●
●
●●
●
Common practice: Model values over threshold (“Peaks Over
Thresholds” or ”POT”), by throwing away bulk of data to improve
tail estimation

4/28
Motivation behind POT
The main motivation is to obtain low-dimensional parametric
estimation that focuses primarily on the tail decay rate which
may be quantiﬁed by the tail index1 ξ
By Pickands-Balkema-de Haan Theorem, the truncated CDF
above a large threshold is well approximated by a generalized
Pareto distribution with matching tail index ξ
1
For polynomially decaying tails, ξ = limy→∞
− log(1−F(y))
log y

5/28
Issues with POT
Setting the correct/optimal threshold is extremely challenging
POT is diﬃcult to extend as a model for more complex data
with spatio-temporal dependence or other structures
Our goal: develop a semiparametric model for heavy tailed
data where the tails are estimated under parametric
assumptions whereas the center is estimated
nonparametrically!

6/28
Transformation to separate tail from the bulk

7/28
Transformation
Setting.
Data range = (a, b), with a = −∞ and/or b = ∞
{gθ : θ ∈ Θ} a parametric family of pdfs on (a, b)
Gθ denotes the CDF of gθ
Lemma
For any pdf f on (a, b) and any θ ∈ Θ there exists a unique pdf
h = hθ,f on (0, 1) such that
f (y) = gθ(y)h(Gθ(y)), y ∈ (a, b).
Proof. Take Y ∼ f and take h to be the pdf of U = Gθ(Y )

8/28
Tail matching
Suppose gθ and f are continuous densities
Then h = hθ,f is continuous and the two limits
lim
y a
f (y)
gθ(y)
= lim
u 0
h(u) =: h(0),
lim
y b
f (y)
gθ(y)
= lim
u 1
h(u) =: h(1)
exist but could equal 0 or ∞.
Corollary
f and gθ have same right and/or left tail index if and only if
0 < h(1) < ∞ and/or 0 < h(0) < ∞

9/28
Tail-identified transformation
Definition
The family {gθ : θ ∈ Θ} is tail-identified if θ = θ implies gθ and
gθ have distinct right and/or left tail indices.
Lemma
If {gθ : θ ∈ Θ} is tail-identified then for any pdf f on (a, b) there is
at most one θf ∈ Θ with h = hθf ,f satisfying 0 < h(0), h(1) < ∞.

10/28
Semiparametric density model for bulk + tail
{gθ : θ ∈ Θ} a tail-identiﬁed family
H := {h(·) a cont pdf on [0, 1] : 0 < h(0), h(1) < ∞}
F := {f (·) = gθ(·)h(Gθ(·)) : θ ∈ Θ, h ∈ H}
Model: Y1, Y2, . . .
IID
∼ f , f ∈ F

12/28
Logistic GP prior on H
Definition (The logistic transform)
L : C([0, 1]) → H given by
(Lw)(u) =
ew(u)
1
0 ew(t)dt
, u ∈ [0, 1].
Definition (The logistic GP)
LGP(µ, σ) = L∗GP(µ, σ)
I.e., h ∼ LGP(µ, σ) ⇐⇒ h = Lw with w ∼ GP(µ, σ)

13/28
sLGP heavy-tailed density estimation on R
Model: Y1, . . . , Yn ∼ f (·) = gθ(·)h(Gθ(·))
h ∼ LGP(0, κ2CSE
λ ), (κ2, λ2) ∼ Ga−1
× Ga−1
gθ = tν(µ, τ2), θ = (µ, τ2, ν) ∼ 1
τ2 × πν(ν)

14/28
gθ = t3(0, 1), λ = 0.3 (top) and 0.08 (bottom)
0.0 0.2 0.4 0.6 0.8 1.0
−2−10123
w
u
w(u)
0.0 0.2 0.4 0.6 0.8 1.0
0.51.01.52.0
h
u
h(u)
−4 −2 0 2 4
0.00.20.40.6
f
y
f(y)
0.0 0.2 0.4 0.6 0.8 1.0
−2−1012
w
u
w(u)
0.0 0.2 0.4 0.6 0.8 1.0
012345
h
u
h(u)
−4 −2 0 2 4
0.00.51.01.5
f
y
f(y)

15/28
Model ﬁtting
Low-rank approximation to GP w
Discretization of length-scale λ over a dense grid
Precomputed covariance matrix + Cholesky factors
Adaptive Metropolis MCMC

18/28
Simulation
Simulation setup
Mixture standard normal and centered t4.
100 data sets, n=2000
−10 −5 0 5 10
0.000.050.100.150.200.25
Simulation 2
x
Density
Mixture centered normal & t with 4 df
Check
1. Parameters of gθ, speciﬁcally, tail index ν−1
2. Estimation of high (and low) quantiles

19/28
Results
LGP Tail Index Estimation
Parameter True Mean Estimates Coverage 95% CI
ν 4 3.76 88%
ξ 0.25 0.27 88%
Comparison to Generalized Pareto Distribution (GPD)
Fit GPD to absolute values over Q0.975; expect n=100 above
Fit using Maximum likelihood

20/28
Extreme quantiles
Upper-tail Bias and Ratio of RMSE
0
5
10
0.99 0.999 0.9999 0.99999
p
QuantileBias
Estimator
GPD
LGP
0.9
1.0
1.1
1.2
1.3
0.99 0.999 0.9999 0.99999
p
QuantileRMSERatio(GPD/LGP)
GPD has lower quantile RMSE just beyond threshold (p=0.98, 0.99)
Otherwise, LGP has lower RMSE in extrapolated tails

21/28
Transition from parametric to non-parametric
F−1(p) ≈ G−1
θ ( p
h(0)), and, F−1(1 − p) ≈ G−1
θ (1 − p
h(1)), p ≈ 0
−20
−10
0
0.001 0.01 0.1
p
Quantile
Quantile
LGP CI
Parametric CI
True
Dataset 1, Lower Quantiles
0
10
20
0.9 0.99 0.999
p
Quantile
Quantile
LGP CI
Parametric
True
Dataset 1, Upper Quantiles

22/28
Conclusions and ongoing work
Summary
sLGP approach provides a promising alternative to POT
Despite bias, sLGP reduces variance suﬃciently in quantile
estimation to provide lower RMSE than GPD
Ongoing work
Change gθ for one with diﬀerent indices in each tail
Get theory results for estimation of θ
Extend sLGP to multivariate, time-series, spatio-temproal,
regression etc.
Application to wind speed vs direction analysis

24/28
Asymmetric t: approach 1

25/28
Asymmetric t: approach 2
Could use gα,ν(y) = g(y)hα,β(G(y)) with
1. g(y) = tν0
2. hα,β is the Be(α, β) pdf
Tail-index:
Left: (αν0)−1
Right: (βν0)−1

26/28
Theory for sLGP
Theorem
If true f ∗ matches model: f ∗(y) = gθ∗ (y)h∗(Gθ∗ (y)) with
1. log h∗ ∈ Cα([0, 1]) and
2. {gθ : θ ∈ Θ} is regular at θ∗ (e.g., Cram´er conditions)
then
plim
n→∞
Π f − f ∗
1 ≥ Mnn− α
2α+1 (log n)q
|Y1, . . . , Yn = 0
for any Mn → ∞ with q = (4α + 1)/(4α + 2)
What about estimation of θ?
Is uncertainty suppressed for extreme quantiles?

27/28
Extension to dependent data
Primary idea is to use copula, which oﬀers a conceptually
simple and sound extension and keeps model ﬁtting relatively
simple.
There are concerns about appropriate choice of copulas –
particularly in terms of what tail dependence models are
derived from them

28/28
Wind analysis
Use polar transformation to represent data as wind direction and
wind speed and model the latter as a possibly heavy tailed data

CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar, May 16, 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar, May 16, 2018

Similar to CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar, May 16, 2018 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

CLIM Transition Workshop - Semiparametric Models for Extremes - Surya Tokdar, May 16, 2018