CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Data Sets - Brian Reich, Oct 24, 2017

Geostats for large datasets
Brian Reich, NCSU
SAMSI, 10/24/2017
Brian Reich, NC State Geostats for large datasets 1 / 26

Large spatial datasets
Spatial statistics is a mature ﬁeld and there is software now
to implement most of the fundamental methods
In the last 10-20 years however datasets have gotten
bigger than standard methods can handle
This is due to new technology like satellites and other
remote sensing devices
Today we’ll discuss several methods to accommodate
large spatial datasets

Deﬁnitions
Let s1, ..., sn be the n sample locations
The data are denoted Y = [Y(s1), ..., Y(sn)]T
For now assume the mean is zero, E[Y(s)] = 0 for all s
The isotropic covariance function is
Cov[Y(si), Y(sj)] = C(h; θ)
where
h = ||si − sj || is the distance between si and sj
θ = (θ1, ..., θp) are the covariance parameters (e.g., nugget)
The n × n covariance matrix Σ(θ) has (i, j) element
C(||si − sj||; θ)

Maximum likelihood estimation
The negative log-likelihood function is
L(θ) = log |Σ(θ)| + YT
Σ(θ)−1
Y
The MLE for θ minimizes L(θ)
This is a disaster for large n!
Storing Σ(θ) is impossible for n ≈ 50, 000
Computing |Σ(θ)| or Σ(θ)−1 is O(n3) and impossible for
n ≈ 10, 000

Overview
By now there are many, many methods for this
Variogram ﬁtting
Approximate likelihood methods
Covariance tapering
Spectral methods
Low-rank approximations
Stochastic partial differential equations approximation
More to come I’m sure

Variogram estimation
The variogram is
2γ(h; θ) = E{[Y(si) − Y(sj)]2
}
The semivariogram relates to the covariance as
γ(h; θ) = C(0; θ) − C(h; θ)
Last week you computed the sample variogram ˆγ for
several distances h1, ..., hJ
An estimator of θ is the minimizer of
J
j=1
[ˆγ(hj) − γ(hj; θ)]2
This is like a method of moments estimator

Pairwise-likelihood approximation
Let l(Y1, ..., Yn; θ) be the negative log MVN PDF of
Y1, ..., Yn
The full negative log-likelihood is then
L(θ) = l(Y1, ..., Yn; θ)
This is an O(n3) computation
The pairwise likelihood approximation is
L(θ) ≈
i<j
l(Yi, Yj; θ)
This is an embarrassingly parallelizable O(n2) computation

Independent-block approximation
Let P1, ..., PJ be a partition of the spatial domain
Denote nj as the number of observations in block Pj
Let Yj1, ..., Yjnj
be the observations in block Pj
The independent-block approximation is
L(θ) ≈
J
j=1
l(Yj1, ..., Yjnj
; θ)
Say there are J =
√
n blocks each with nj = n/J
This is O(n3/2) and embarrassingly parallelizable

Veccia approximation
If Y1, ..., Yn are MVN, then the conditional distribution of
one observation given the rest is univariate normal
Denote φ(Yi; Y(i), θ) as the conditional distribution of one
observation Yi given the vector of observations Y(i)
The full likelihood can be written
n
i=2
log φ(Yi; Y(i), θ)
where Y(i) = (Y1, ..., Yi−1)T
This is not helpful because the ﬁnal term still has an
(n − 1) × (n − 1) covariance matrix to be inverted

Veccia approximation
The Veccia approximation trims the conditioning sets
We can approximate the likelihood by letting
Y(i) ⊂ {Y1, ..., Yi−1}
For example, we might condition on only the m = 15 points
in s1, ..., si−1 that are closest to si
The full likelihood can be approximated
n
i=m
log φ(Yi; Y(i), θ)
This is O(n) and can be done in blocks and/or in parallel

Likelihood approximations
When are likelihood approximations valid?
A likelihood approximation is unbiased if
EY|θ
∂L(θ)
∂θj
= 0
for all j and θ = (θ1, ..., θp)
This holds for all of the approximations we’ve discussed
Standard errors can be computed using sandwich
covariance estimators

Tapering
For most covariance functions, Σ(θ) is dense, i.e., all
entries are non-zero
A sparse matrix is one with many zero entries
Sparse matrix operations can be fast
Even though the covariance is always positive, it
approaches zero for distant pairs of points

Tapering
Tapering sets the covariance to zero for points past a
certain distance
Let CT (h; θ) be the tapered covariance function with
CT (h; θ) = 0 for h > h0
You can’t simply threshold a non-sparse covariance
function, you must be careful to preserve a valid covariance
Denote the n × n tapered covariance matrix as ΣT (θ)
The approximate negative log-likelihood is
log |ΣT (θ)| + YT
ΣT (θ)−1
Y

Low-rank approximations
Say we can decompose the spatial process into a smooth
component f(s) and iid errors e(s),
Y(s) = f(s) + e(s)
Then we can approximate f using a linear combination of
L < n basis functions B1(s), ..., BL(s),
f(s) =
L
l=1
Bl(s)bl
where b1, ..., bL are unknown coefﬁcients
We are free to pick any basis we want, e.g., Bl(s) could be
a polynomial function of s

Examples
EOFs: Bl are eigenvectors of the sample covariance
Spectral: Bl are trig functions of s
Splines: Bl are spline functions
Fixed rank Kriging
Kernel convolutions
Predictive process
Multiresolution approximation
More I’m sure

Kernel convolution
Any stationary Gaussian process f can be written as
f(s) = B(s − v; θ)b(v)dv
where B is a kernel function and b is a white noise process
The induced covariance is
Cov[f(s), f(s )] = B(s − v; θ)B(s − v; θ)dv
Example: B(s − v; θ) = θ1 exp −θ2||s − v||2
Cov[f(s), f(s )] = θ2
1 exp −θ2||s − s ||2

Kernel convolution
The low-rank approximation is obvious
f(s) = B(s − v; θ)b(v)dv ≈
L
l=1
B(s − vl; θ)bl
The knots v1, ..., vL cover the spatial domain
The coefﬁcients bl
iid
∼ Normal(0, 1)
Cov[f(s), f(s )] =
L
l=1
B(s − vl; θ)B(s − vl; θ)
Estimation: MLE or Bayes

Predictive process model
Spatial process at the data points: fs = [f(s1), ..., f(sn)]T
Spatial process at the knots: fv = [f(v1), ..., f(vL)]T
If we knew fv , the Kriging prediction for fs would be
fs = Σsv (θ)Σvv (θ)−1
fv
Can be written as a low-rank process
f(s) =
L
l=1
Bl(s; θ)bl
where Bl are complicated functions of the covariance
matrices and bl = f(vl) with fv ∼ Normal[0, Σvv (θ)]
Model is exact if the knots are s1, ..., sn

SPDE
The stochastic partial differential equation (SPDE)
approach combines many of these ideas
It’s a bit complicated, so like stick with the case of a Matern
correlation with smoothness ν = 1 and variance σ2 = 1
It turns out that a Matern Gaussian process with ν = 1 is
the solution to the SPDE
1
φ
Y(s) −
∂2Y(s)
∂s2
1
−
∂2Y(s)
∂s2
2
=
4π
φ2
Z(s),
where φ is the range and Z(s) is a white noise process

Finite approximation
Say the observations are on a grid s ∈ {..., −1, 0, 1, ...}2
Denote Y(s1, s2) as the value is row s1 and column s2
The approximate second derivative (analogous for s2) is
∂2Y(s)
∂s2
1
= Y(s1 + 1, s2) − 2Y(s1, s2) + Y(s1 − 1, s2)

Finite approximation
Inserting the ﬁnite approximation into the SPDE gives
1 +
1
4φ2
Y(s1, s2) − ¯Y(s1, s2) = wZ(s1, s2),
where
¯Y(s1, s2) is the mean of (s1, s2)’s four neighbors
w = π
φ2
Z(s1, s2) are iid standard normal
φ = ∞ gives a random walk and φ = 0 gives white noise

SPDE
In matrix notation, the system of equations is
B(θ)Y = wZ
where
B(θ) = 1 + 1
4φ2 In − 1
4 A
A is the adjacency matrix with (i, j) equal 1 if sites i and j
are adjacent, and zero otherwise
Z ∼ Normal(0, In)

SPDE
Solving for Y gives
Y = wQ(θ)Z
where Q(θ) = B(θ)−1
Therefore Y ∼ Normal 0, w2Q(θ)Q(θ)
B is sparse and so is the inverse covariance, B(θ)B(θ)

What if data are not on a grid
Assume the knots v1, ..., vL are on a grid
The SPDE model is given to the responses at the knot
locations
f(v) ∼ Normal 0, w2
Q(θ)Q(θ)
The response at site s is then
Y(s) =
L
l=1
Bl(s − vl)f(vl) + e(s)
The authors use local linear functions for Bl

Final comments
Extensions that are easy
Adding covariates
Simple spatiotemporal models
Simple multivariate models
Extensions that are hard
Non-Gaussian data
Non-stationarity
Bayesian
Hierarchical models

Group discussion
Everybody read the ﬁrst section of
https://arxiv.org/pdf/1710.05013.pdf
Read the assigned subsection of Section 2
Discuss your subsection with your group
Meet with another group and explain your assigned
subsection to the other group
Repeat
Skim Sections 3 and 4
Discuss with your group

CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Data Sets - Brian Reich, Oct 24, 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Data Sets - Brian Reich, Oct 24, 2017

Similar to CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Data Sets - Brian Reich, Oct 24, 2017 (20)

More from The Statistical and Applied Mathematical Sciences Institute

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded

Recently uploaded (20)

CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Data Sets - Brian Reich, Oct 24, 2017