MUMS: Bayesian, Fiducial, and Frequentist Conference - Generalized Probabilistic Principal Component Analysis of Correlated Mortality, Mengyang Gu, April 30, 2019
Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Generalized Probabilistic Principal Component Analysis of Correlated Mortality, Mengyang Gu, April 30, 2019
Similar to MUMS: Bayesian, Fiducial, and Frequentist Conference - Generalized Probabilistic Principal Component Analysis of Correlated Mortality, Mengyang Gu, April 30, 2019 (20)
Introduction to ArtificiaI Intelligence in Higher Education
MUMS: Bayesian, Fiducial, and Frequentist Conference - Generalized Probabilistic Principal Component Analysis of Correlated Mortality, Mengyang Gu, April 30, 2019
1. Generalized probabilistic principal component analysis
of correlated data
Mengyang Gu and Weining Shen
Department of Applied Mathematics and Statistics
Johns Hopkins University
Department of Statistics
University of California, Irvine
SAMSI BFF Conference
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 1 / 54
2. Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
Correctly specified models
Misspecified models
5 Real examples
Humanity computer model with multiple outputs
Global gridded temperature anomalies
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 2 / 54
3. Introduction
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
5 Real examples
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 3 / 54
4. Introduction
NOAA monthly gridded temperature anomalies
−5
0
5
[oC]
50 150 250 350
−50
0
50
NOAA Temperature Anomalies in Feb 2017
Longitude
Latitude
−5
0
5
[oC]
50 150 250 350
−50
0
50
NOAA Temperature Anomalies in Dec 2018
Longitude
Latitude
Figure 1: NOAA monthly gridded temperature anomalies.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 4 / 54
5. Introduction
Ground deformation by radar interferograms
−5000 0 5000
−6000−2000020006000
interferogram 1
x1
x2
−0.05
0.00
0.05
m/yr
−5000 0 5000
−6000−2000020006000
interferogram 2
x1
x2
−0.05
0.00
0.05
m/yr
−5000 0 5000
−6000−2000020006000
interferogram 3
x1
x2
−0.05
0.00
0.05
m/yr
−5000 0 5000
−6000−2000020006000
interferogram 4
x1
x2
−0.05
0.00
0.05
m/yr
−5000 0 5000
−6000−2000020006000
interferogram 5
x1
x2
−0.05
0.00
0.05
m/yr
Figure 2: Five interferometric synthetic aperture radar (InSAR) interferograms spanning the
following time periods: 1) 17 Oct 2011 - 04 May 2012; 2) 21 Oct 2011 - 16 May 2012; 3); 20
Oct 2011 to 15 May 2012; 4) 28 Oct 2011 to 11 May 2012; 5) 12 Oct 2011 - 07 May 2012.
The black curves show cliffs and other important topographic features at K¯ılauea; the large
elliptical feature is K¯ılauea Caldera. The color indicates the ground deformation rate per year.
The figures are from [Gu and Anderson, 2018].
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 5 / 54
6. Introduction
Emulation of computer models with multiple outputs
Figure 3: Median (truncated at 20 meters at the volcanic center region) and
interquartile range of the GaSP emulator of ‘maximum flow height over time’ for
TITAN2D, at 23,040 spatial locations over Montserrat Island and for new input
values V ∗
= 106.9984
, ϕ∗
= 3.3487, δ∗
bed = 10.8790, and δ∗
int = 31.0300. The
figures are from [Gu and Berger, 2016].
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 6 / 54
8. Introduction
A latent factor model
Let y(x) = (y1(x), ..., yk(x))T
be a k-dimensional real-valued output vector at a
p-dimensional input vector x. Assume yj(x) has zero mean for now.
Consider the following latent factor model
y(x) = Az(x) + , (1)
The k × d factor loading matrix A = [a1, ..., ad] relates the k-dimensional
outputs to a d-dimensional factor processes z(x) = (z1(x), ..., zd(x))T
,
where d ≤ k. Assume the independence between any two factor processes.
Assume Zl = (zl(x1), ..., zl(xn)) follows a multivariate normal distribution
ZT
l ∼ MN(0, Σl), (2)
where Σl can be parameterized by a covariance function such that the (i, j)
entry of Σl is σ2
l Kl(xi, xj), where Kl(·, ·) is a kernel function, for
l = 1, ..., d and 1 ≤ i, j ≤ n.
This model is often referred as the semiparameteric latent factor model
[Seeger et al., 2005, Alvarez et al., 2012] and it is a special case of the linear
model of corregionalization (LMC) [Gelfand et al., 2004].
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 8 / 54
9. Introduction
Estimation of the factor loading matrix
Let Y = [y(x1), ..., y(xn)] be the k × n matrix of the observations and let
Z = [z(x1), ..., z(xn)] be the d × n latent factor matrix.
It is popular to estimate A by PCA. [Higdon et al., 2008, Paulo et al., 2012]
estimate A by the first d columns of
√
nU0D
1/2
0 where U0D0UT
0 are the
eigendecomposition of YYT
/n.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 9 / 54
10. Introduction
Estimation of the factor loading matrix
Let Y = [y(x1), ..., y(xn)] be the k × n matrix of the observations and let
Z = [z(x1), ..., z(xn)] be the d × n latent factor matrix.
It is popular to estimate A by PCA. [Higdon et al., 2008, Paulo et al., 2012]
estimate A by the first d columns of
√
nU0D
1/2
0 where U0D0UT
0 are the
eigendecomposition of YYT
/n.
In [Tipping and Bishop, 1999], they study a latent factor model
Y = AZ + ,
with independent factors Z ∼ N(0, Ink). Assume each row of Y is zero,
the maximum marginal likelihood estimator (MMLE) of A is the first d
columns U0(D0 − σ2
0Ik)1/2
R, where R is an arbitrary d × d orthogonal
rotation matrix.
Note that the model (1) is unchanged if one replaces the pair (A, z(x)) by
(AE, E−1
z(x)) for any invertible matrix E. So only the subspace of A,
denoted as M(A), can be uniquely determined.
The linear subspace by the above PCA for the LMC model and MMLE of
the (independent) latent factor model are the same.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 9 / 54
11. Introduction
Research goals
What is the maximum marginal likelihood estimator of the factor
loadings (and other parameters) in the latent factor model (1) (where
the factors are dependent)?
What are the predictive distributions of the new data?
Are they computationally feasible?
If we have additional regressors (covariates), can we also combine
them in the model in a coherent way?
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 10 / 54
12. Generalized probabilistic principal component analysis (GPPCA)
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
5 Real examples
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 11 / 54
13. Generalized probabilistic principal component analysis (GPPCA)
Orthogonal assumption
Since only the linear subspace of the factor loading matrix M(A) is identifiable,
we assume the columns of A in model (1) are orthonormal:
Assumptions 1
AT
A = Id. (3)
Note one may assume AT
A = cId where c is a positive constant which can
potentially depend on k, e.g. c = k. But the variance parameters of the
factor processes are estimated by the data so we focus on Assumption 1.
This assumption is also the key for some other estimators of factor loading
matrix [Lam et al., 2011, Lam and Yao, 2012].
The MLE of the factor loading matrix A under the Assumption 1 is
√
nU0R
(without marginalizing out Z), where U0 is the first d ordered eigenvectors
of YYT
/n and R is an orthogonal rotation matrix (same as the PCA). E.g.
[Bai and Ng, 2002] and [Bai, 2003] assume that AT
A = kId and estimate
A by
√
kU0 in modeling high-dimensional time series.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 12 / 54
14. Generalized probabilistic principal component analysis (GPPCA)
Marginal likelihood
(Known expression of the marginal likelihood) Denote the vectorization
of the output Yv = vec(Y) and the d × n latent factor matrix
Z = (z(x1), ..., z(xn)) at inputs {x1, ..., xn}. After marginalizing out Z, Y
follows a multivariate normal distribution as follows ([Banerjee et al., 2014])
Yv | A, σ2
0, Σ1, ..., Σd ∼ MN 0,
d
l=1
Σl ⊗ (alaT
l ) + σ2
0Ink .
Lemma 1 (Marginal likelihood)
Under Assumption 1, the marginal distribution of Yv in model (1) is the
multivariate normal distribution as follows
Yv | A, σ2
0, Σ1, ..., Σd ∼ MN
0, σ2
0 Ink −
d
l=1
(σ2
0Σ−1
l + In)−1
⊗ (alaT
l )
−1
,
for l = 1, ..., d.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 13 / 54
15. Generalized probabilistic principal component analysis (GPPCA)
Theorem 1 (Maximum marginal likelihood estimator)
For model (1), under Assumption 1, after marginalizing out Z,
1. if Σ1 = ... = Σd = Σ, the marginal likelihood is maximized when
ˆA = UR, (4)
where U is a k × d matrix of the first d principal eigenvectors of
G = Y(σ2
0Σ−1
+ In)−1
YT
, (5)
and R is an arbitrary d × d orthogonal rotation matrix;
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 14 / 54
16. Generalized probabilistic principal component analysis (GPPCA)
Theorem 1 (Maximum marginal likelihood estimator)
For model (1), under Assumption 1, after marginalizing out Z,
1. if Σ1 = ... = Σd = Σ, the marginal likelihood is maximized when
ˆA = UR, (4)
where U is a k × d matrix of the first d principal eigenvectors of
G = Y(σ2
0Σ−1
+ In)−1
YT
, (5)
and R is an arbitrary d × d orthogonal rotation matrix;
2. if the covariances of the factor processes are different, denoting
Gl = Y(σ2
0Σ−1
l + In)−1
YT
, the maximum marginal likelihood estimator of
A is
ˆA = argmaxA
d
l=1
aT
l Glal, s.t. AT
A = Id, (6)
A numerical optimization algorithm that preserves the orthogonal constraints in
(6) is introduced in [Wen and Yin, 2013].
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 14 / 54
17. Generalized probabilistic principal component analysis (GPPCA)
Generalized probabilistic principal component analysis
The estimator in Theorem 1 is called the generalized probabilistic
principal component analysis (GPPCA), which is a direct extension of
the PPCA in [Tipping and Bishop, 1999] when the factors are correlated.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 15 / 54
18. Generalized probabilistic principal component analysis (GPPCA)
Generalized probabilistic principal component analysis
The estimator in Theorem 1 is called the generalized probabilistic
principal component analysis (GPPCA), which is a direct extension of
the PPCA in [Tipping and Bishop, 1999] when the factors are correlated.
For demonstration purposes, let (i, j)-term of Σl be σ2
l Kl(xi, xj), where
Kl(·, ·) is a kernel functions, having parameters γl.
Denote the signal to noise ratio (SNR) τl =
σ2
l
σ2
0
. Let τ = (τ1, ..., τd) and
γ = (γ1, ..., γd). The maximum marginal likelihood estimator of σ2
0 becomes
a function of ˆA, τ and γ as σ2
0 = ˆS2
/(nk), where
ˆS2
= tr(YT
Y) −
d
l=1 ˆaT
l Y(τ−1
l K−1
l + In)−1
YT
ˆal.
Plugging ˆA and ˆσ2
0, the marginal likelihood satisfies
L(τ, γ | Y, ˆA, ˆσ2
0) ∝
d
l=1
|τlKl + In|−1/2
| ˆS2
|−nk/2
. (7)
After obtaining (ˆτ, ˆγ) by maximizing the marginal likelihood, one get the ˆA,
ˆσ2
0, and ˆσ2
l for l = 1, ..., d.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 15 / 54
19. Generalized probabilistic principal component analysis (GPPCA)
Computational complexity
Each evaluation of the likelihood in (7), one needs
max(O(dn3), O(dkn)) in general.
Each evaluation of the optimization function for estimating A in
Theorem 1, one needs max(O(dn3), O(dkn)) in general; to solve the
eigenproblem when the covariance is shared, one needs min(kn2, k2n).
When the input is one-dimensional and the Mat´ern kernel are used,
the computational operations are only O(dkn) for computing the
likelihood in (7) without any approximation (see e.g. [Whittle, 1954,
Hartikainen and Sarkka, 2010]). There is an R package, called
“FastGaSP” in CRAN that implements the fast algorithm for
Gaussian process with Mat´ern kernel [Gu, 2019].
To directly solve the eigenproblem still has the rate
min(O(kn2), O(k2n)), but the iterative algorithm has rate O(dkn).
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 16 / 54
20. Generalized probabilistic principal component analysis (GPPCA)
Let ˆΣl be the estimator of the covariance matrix for the lth factor, where the
(i, j) element of ˆΣl is ˆσ2
l
ˆKl(xi, xj) by plugging in ˆσ2
l and ˆγl.
Theorem 2 (Predictive distribution)
Under the Assumption 1, for any x∗
, one has
Y(x∗
) | Y, ˆA, ˆγ, ˆσ2
, ˆσ2
0 ∼ MN ˆµ∗
(x∗
), ˆΣ∗
(x∗
) ,
where
ˆµ∗
(x∗
) = ˆAˆz(x∗
), (8)
with ˆz(x∗
) = (ˆz1(x∗
), ..., ˆzd(x∗
))T
, ˆzl(x∗
) = ˆΣT
l (x∗
)(ˆΣl + ˆσ2
0In)−1
YT
ˆal,
ˆΣl(x∗
) = ˆσ2
l ( ˆKl(x1, x∗
), ..., ˆKl(x1, x∗
))T
for l = 1, ..., d, and
ˆΣ∗
(x∗
) = ˆA ˆD(x∗
)ˆAT
+ ˆσ2
0(Ik − ˆAˆAT
) (9)
with ˆD(x∗
) being a diagonal matrix, and the lth diagonal term being
ˆDl(x∗
) = ˆσ2
l
ˆKl(x∗
, x∗
) + ˆσ2
0 − ˆΣT
l (x∗
) ˆΣl + ˆσ2
0In
−1
ˆΣl(x∗
) .
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 17 / 54
21. Generalized probabilistic principal component analysis (GPPCA)
Illustrative example
Example 1
The data are sampled from the latent factor model (1) with the shared
covariance matrix Σ1 = Σ2 = Σ, where x is equally spaced from 1 to n
and the kernel function is assumed to follow (14) with γ = 100 and
σ2 = 1. We choose k = 2, d = 1 and n = 100. Two scenarios are
implemented with σ2
0 = 0.01 and σ2
0 = 1, respectively. The parameters
(σ2
0, σ2, γ) are assumed to be unknown and estimated from the data.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 18 / 54
22. Generalized probabilistic principal component analysis (GPPCA)
qq
qq
q
q
qq
q
qqqq
qqqq
qq
qqq
q
q
q
q
q
q
q
qq
qqq
q
q
q
qq
q
qq
q
qq
q
q
q
qq
q
q
q
qqq
q
q
q
q
q
q
qq
qqqqqq
q
q
qqqq
q
q
qq
q
q
qqq
q
q
q
q
q
q
qq
q
qq
q
q
q
q
0 20 40 60 80 100
−1.5−0.50.51.0
x
y
q
q
q
q
qqqqq
q
q
q
qqq
qq
q
q
q
qq
q
q
q
qq
q
q
q
q
q
q
qqqq
q
qq
q
q
qq
q
q
qq
q
q
q
q
qqq
q
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
qqq
qqq
qq
q
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0 20 40 60 80 100
−1012345
x
y~
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
−1.0 −0.5 0.0 0.5
−1.0−0.50.00.5
y1
y2
q
q
q
q
q
q
qq
q
q
qqq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
0 20 40 60 80 100
−3−2−10123
x
y
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
0 20 40 60 80 100
−4−3−2−101
x
y~
q
q
q
qq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
−3 −2 −1 0 1 2
−3−2−10123
y1
y2
Figure 5: Estimation of the factor loading matrix by the PCA and GPPCA for Example
1 with the variance of the noise being σ2
0 = 0.01 and σ2
0 = 1, graphed in the upper and
lower panels, respectively. The circles and dots are the first and second rows of Y in the
left panel, and of ˜Y = YL in the middle panels, where L = UD1/2
with U being the
eigenvectors and the diagonals of D being the eigenvalues of (ˆσ2
0
ˆΣ−1
+ In)−1
. In the
right panels, the black, red and blue lines are the subspace of A, the first eigenvector of
U0 and Y(ˆσ2
0
ˆΣ−1
+ In)−1
YT
, respectively, with the black triangles being the outputs.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 19 / 54
23. Generalized probabilistic principal component analysis (GPPCA)
Estimation of the mean
0 20 40 60 80 100
−1.0−0.50.00.5
x
Y
^
1
PCA
GPPCA
Truth
qq
q
q
q
q
q
q
q
qqqq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
qq
q
qq
q
qq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
0 20 40 60 80 100
−0.20.00.20.4
x
Y
^
2
PCA
GPPCA
Truth
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
qq
qq
q
q
q
q
q
qq
q
q
qq
q
q
q
q
qqq
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
0 20 40 60 80 100
−3−2−1012
x
Y
^
1
PCA
GPPCA
Truth
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 20 40 60 80 100
−2−10123
x
Y
^
2
PCA
GPPCA
Truth
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
Figure 6: Estimation of AZ for Example 1 with the variance of the noise being
σ2
0 = 0.01 and σ2
0 = 1, graphed in the upper panels and lower panels, respectively. The
first row and second row of Y are graphed as the black curves in the left and right
panels, respectively. The red dotted curves and the blue dashed curves are the prediction
by the PCA and GPPCA, respectively. The grey region is the 95% posterior credible
interval from GPPCA.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 20 / 54
24. GPPCA with a mean structure
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
5 Real examples
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 21 / 54
25. GPPCA with a mean structure
Latent factor model with covariates
Consider the latent factor model with a mean structure for a k-dimensional
output vector at the input x,
y(x) = (h(x)B)T
+ Az(x) + , (10)
where h(x) is a 1 × q known mean basis function related to input x and possibly
other covariates, B = (β1, ..., βk) is a q × k matrix of the mean (or trend)
parameters. Denote M = In − H(HT
H)−1
HT
. We have the following lemma
for the marginal likelihood estimator of the variance.
Lemma 2
Consider an objective prior π(B) ∝ 1. Under Assumption 1, after marginalizing
out B and Z, the maximum likelihood estimator for σ2
0 is ˆσ2
0 = S2
M /k(n − q),
where S2
M = tr(YMYT
) −
d
l=1 aT
l YM(M + τ−1
l K−1
l )−1
MYT
al. Moreover,
the marginal density of the data satisfies
p(Y | A, τ, γ, ˆσ2
0) ∝
d
l=1
|τlKl + In|
−1/2
HT
(τlKl + In)−1
H
− 1
2
S2
M
−(k(n−q)
2 )
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 22 / 54
26. GPPCA with a mean structure
GPPCA with the mean structure
Since there is no closed-form expression for the parameters (τ, γ) in the
kernels, one can numerically maximize the Equation (11) to estimate A
and other parameters.
ˆA = argmaxA
d
l=1
aT
l Gl,M al, s.t. AT
A = Id, (11)
(ˆτ, ˆγ) = argmax(τ,γ)p(Y | ˆA, τ, γ). (12)
When Σ1 = ... = Σd, the closed-form expression of ˆA can be obtained
similarly in Theorem 1. In general, we can use the approach in [Wen and
Yin, 2013] for solving the optimization problem in (12). After obtaining ˆτ
and ˆσ2
0, we transform them to get ˆσ2
l = ˆτlˆσ2
0 for l = 1, ..., d.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 23 / 54
27. GPPCA with a mean structure
Theorem 3 (Predictive distribution)
Under the Assumption 1, after marginalizing out Z and B by the objective prior
π(B) ∝ 1, the predictive distribution of model (10) for any x∗
is
Y(x∗
) | Y, ˆA, ˆγ, ˆσ2
, ˆσ2
0 ∼ MN ˆµ∗
M (x∗
), ˆΣ∗
M (x∗
) .
Here
ˆµ∗
M (x∗
) = h(x∗
)ˆB
T
+ ˆAˆzM (x∗
),
where ˆB = (HT
H)−1
H(Y − ˆAˆZM )T
, ˆZM = (ˆZT
1,M , ..., ˆZT
d,M )T
, with
ˆZl,M = aT
l YM( ˆΣlM + ˆσ2
0In)−1 ˆΣl, and ˆzM (x∗
) = (ˆz1,M (x∗
), ..., ˆzd,M (x∗
))T
with
ˆzl,M (x∗
) = ˆΣT
l (x∗
)( ˆΣlM + ˆσ2
0In)−1
MYal, for l = 1, ..., d. Moreover,
ˆΣ∗
M (x∗
) = ˆA ˆDM (x∗
) ˆAT
+ ˆσ2
0(1 + h(x∗
)(HT
H)−1
hT
(x∗
))(Ik − ˆA ˆAT
),
where ˆDM (x∗
) is a diagonal matrix with the lth diagonal term being
ˆDl,M (x∗
) = ˆσ2
l
ˆKl(x∗
, x∗
) + ˆσ2
0 − ˆΣT
l (x∗
) ˜Σ−1
l
ˆΣl(x∗
)
+ (hT
(x∗
) − HT ˜Σ−1
l
ˆΣl(x∗
))T
(HT ˜Σ−1
l H)−1
(hT
(x∗
) − HT ˜Σ−1
l
ˆΣl(x∗
)),
with ˜Σl = ˆΣl + ˆσ2
0In for l = 1, ..., d.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 24 / 54
28. Simulated examples
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
Correctly specified models
Misspecified models
5 Real examples
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 25 / 54
29. Simulated examples Correctly specified models
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
Correctly specified models
Misspecified models
5 Real examples
Humanity computer model with multiple outputs
Global gridded temperature anomalies
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 26 / 54
30. Simulated examples Correctly specified models
Evaluation criteria
(Largest principal angle). Let 0 ≤ φ1 ≤ ... ≤ φd ≤ π/2 be the principal
angles between M(A) and M(ˆA), recursively defined by
φi = arccos max
a∈M(A),ˆa∈M( ˆA)
|aT
ˆa| = arccos(|aT
i ˆai|),
subject to
||a|| = ||ˆa|| = 1, aT
ai = 0, ˆaT
ˆai = 0, i = 1, ..., d − 1,
where || · || denotes the L2 norm. The largest principal angle is φd. When
the columns of the A and ˆA are orthogonal bases of the M(A) and M(ˆA),
cos(φd) is equal to the smallest singular value of AT ˆA [Bj¨orck and Golub,
1973, Absil et al., 2006].
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 27 / 54
31. Simulated examples Correctly specified models
Evaluation criteria
(Largest principal angle). Let 0 ≤ φ1 ≤ ... ≤ φd ≤ π/2 be the principal
angles between M(A) and M(ˆA), recursively defined by
φi = arccos max
a∈M(A),ˆa∈M( ˆA)
|aT
ˆa| = arccos(|aT
i ˆai|),
subject to
||a|| = ||ˆa|| = 1, aT
ai = 0, ˆaT
ˆai = 0, i = 1, ..., d − 1,
where || · || denotes the L2 norm. The largest principal angle is φd. When
the columns of the A and ˆA are orthogonal bases of the M(A) and M(ˆA),
cos(φd) is equal to the smallest singular value of AT ˆA [Bj¨orck and Golub,
1973, Absil et al., 2006].
(Average mean square error (AvgMSE)) of the output over N experiments:
AvgMSE =
N
l=1
k
j=1
n
i=1
( ˆY
(l)
j,i − E[Y
(l)
j,i ])2
knN
, (13)
where E[Y
(l)
j,i ] is the (j, i)-term of the mean of the output matrix at the lth
experiment, and ˆY
(l)
j,i is the estimation.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 27 / 54
32. Simulated examples Correctly specified models
Approaches
In the GPPCA, we let the covariance function of the lth factor be a product
kernel σ2
l Kl(xa, xb) = σ2
l
p
m=1 Klm(xam, xbm) for demonstration
purposes, where Klm(·, ·) is the Mat´ern kernel with roughness parameter 2.5
Klm(xam, xbm) = 1 +
√
5d
γlm
+
5d2
3γ2
lm
exp −
√
5d
γlm
, (14)
with d = |xam − xbm| and unknown range parameters γl = (γl1, ..., γlp).
The MMLE will be used for estimating factor loading matrix and the
parameters and the predictive mean of the data will be used for prediction.
In PCA, ˆApca = U0, where U0 is the first d eigenvectors of YYT
/n.
In [Lam et al., 2011, Lam and Yao, 2012], A is estimated by
q0
q=1
ˆΣy(q)ˆΣT
y (q) with q0 = 1 and q0 = 5, where ˆΣy(q) is the sample
covariance of the output at lag q.
Independent GPs and parallel partial GPs will also be included for the last
simulated examples.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 28 / 54
33. Simulated examples Correctly specified models
Example 2 (Factors with the same covariance matrix)
The data are sampled from model (1) with Σ1 = ... = Σd = Σ, where
xi = i for 1 ≤ i ≤ n, and the kernel function in (14) is used with γ = 100
and σ2 = 1. In each scenario, we simulate the data from 16 different
combinations of σ2
0, k, d and n. We repeat N = 100 times for each
scenario. The parameters (σ2
0, σ2, γ) are treated as unknown and
estimated from the data.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 29 / 54
34. Simulated examples Correctly specified models
q
q
q
q
q
q
q
q
q
qq
q
qq
qqq
qqqq
qqqqq
q
q
q
q
q
qq
q
q
q
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 8, d = 4 and τ = 100
LargestPrincipalAngle
q
q
q
qq
q
q
qqqqqq
q
qqq
q
q
qqq
q
qq
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 40, d = 4 and τ = 100
LargestPrincipalAngle
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
qqq
qq
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 16, d = 8 and τ = 100
LargestPrincipalAngle
q
q
q
qq
q
q
q
qqq
q
qq q
q
qq
qq
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 80, d = 8 and τ = 100
LargestPrincipalAngle
qq
q
q
q
q
q
q
qq
q
q
q
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 8, d = 4 and τ = 4
LargestPrincipalAngle
q
q
qq
qq
q
qq
q
q
q
q
qq
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 40, d = 4 and τ = 4
LargestPrincipalAngle
q
q
q
qq
q
q
qq
qq
qq
q
q
q
q
q
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 16, d = 8 and τ = 4
LargestPrincipalAngle
qqq
q
q
q
q
q
q
qq
q
q
q
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 80, d = 8 and τ = 4
LargestPrincipalAngle
Figure 7: The largest principal angle. n = 200 and n = 400 for the left four
boxplots and right four boxplots in the first rows, respectively; n = 500 and
n = 1000 in the second row.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 30 / 54
37. Simulated examples Correctly specified models
Example 3 (Factors with different covariance matrices)
The data are sampled from model (1) where xi = i for 1 ≤ i ≤ n. The
variance of the noise is σ2
0 = 0.25 and the kernel function is assumed to
follow from (14) with σ2 = 1. The range parameter γ of each factor is
uniformly sampled from [10, 103] in each experiment. In each scenario, we
simulate the data from 8 different combinations of k, d and n. We repeat
N = 100 times for each scenario. The parameters in the kernels and the
variance of the noise are treated as unknown and estimated from the data.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 33 / 54
38. Simulated examples Correctly specified models
Largest Principal angles for Example 3
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 8 and d = 4
LargestPrincipalAngle
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 40 and d = 4
LargestPrincipalAngle
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qqqq
q
q
q
qq
q
q
q
qqqq
q
q
q
q
q
q
q
qqq
q
q
q
q
q
q
q
q
qq
q
qq
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 16 and d = 8
LargestPrincipalAngle
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
qq
q
qq
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
k = 80 and d = 8
LargestPrincipalAngle
Figure 9: The largest principal angle between the true subspace and the estimated
subspace of the four approaches for Example 3. The number of observations of
each output variable is n = 200 and n = 400 for left 4 boxplots and right 4
boxplots in 2 left panels, respectively. The number of observations is n = 500 and
n = 1000 for left 4 boxplots and right 4 boxplots in 2 right panels, respectively.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 34 / 54
40. Simulated examples Misspecified models
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
Correctly specified models
Misspecified models
5 Real examples
Humanity computer model with multiple outputs
Global gridded temperature anomalies
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 36 / 54
41. Simulated examples Misspecified models
Example 4 (Unconstrained factor loadings and misspecified kernel
functions)
The data are sampled from model (1) with Σ1 = ... = Σd = Σ and xi = i
for 1 ≤ i ≤ n. Each entry of the factor loading matrix is assumed to be
uniformly sampled from [0, 1] independently (without the orthogonal
constraints in (3)). The exponential kernel and the Gaussian kernel are
assumed in generating the data with different combinations of σ2
0 and n,
while in the GPPCA, we still use the Mat´ern kernel function in (14) for
the estimation. We assume k = 20, d = 4, γ = 100 and σ2 = 1 in
sampling the data. We repeat N = 100 times for each scenario. All the
kernel parameters and the noise variance are treated as unknown and
estimated from the data.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 37 / 54
42. Simulated examples Misspecified models
Largest Principal angle for Example 4
q
q
qqq
q
q
q
qqq
q
q
q
q
qq
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
LargestPrincipalAngle
Exponential Kernel
k = 20, d = 4 and τ = 0.25
q
q
q
q
q
q
q
q
q
q
q
qqq
q
q
q
q
q
qq
q
q
q qq
q
qq q
q
q
q
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
LargestPrincipalAngle
Gaussian Kernel
k = 20, d = 4 and τ = 0.25
Figure 10: The largest principal angle between the estimated subspace of four
approaches and the true subspace for Example 4. The number of observations are
assumed to be n = 100, n = 200 and n = 400 for left 4 boxplots, middle 4
boxplots and right 4 boxplots in both panels, respectively. The kernel in
simulating the data is assumed to be the exponential kernel in the left panel,
whereas the kernel is assumed to be the Gaussian kernel in the right panel.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 38 / 54
44. Simulated examples Misspecified models
Example 5 (Unconstrained factor loadings and deterministic factors)
The data are sampled from model (1) with each latent factor being a
deterministic function
Zl(xi) = cos(0.05πθlxi)
where θl
i.i.d.
∼ unif(0, 1) for l = 1, ..., d, with xi = i for 1 ≤ i ≤ n,
σ2
0 = 0.25, k = 20 and d = 4. Four scenarios are considered with the
sample size n = 100, n = 200, n = 400 and n = 800.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 40 / 54
45. Simulated examples Misspecified models
Largest Principal angle for Example 5
q
q
q
q
q
q
q
qq
qq
q
qqq
q
q
qqq
q q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
PCA
GPPCA
LY1
LY5
0.0
0.5
1.0
1.5
LargestPrincipalAngle
Deterministic Factors
k = 20, d = 4 and τ = 4
Figure 11: The largest principal angle between the estimated subspace of the
loading matrix and the true subspace for Example 5. From the left to the right,
the number of observations is assumed to be n = 100, n = 200, n = 400 and
n = 800 for each 4 boxplots, respectively.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 41 / 54
46. Simulated examples Misspecified models
AvgMSE for Example 5
n = 100 n = 200 n = 400 n = 800
PCA 7.0 × 10−2
6.0 × 10−2
5.4 × 10−2
5.2 × 10−2
GPPCA 1.4 × 10−2
9.2 × 10−3
6.7 × 10−3
5.5 × 10−3
LY1 9.8 × 10−1
7.6 × 10−1
6.3 × 10−2
5.7 × 10−2
LY5 9.3 × 10−2
7.3 × 10−2
6.2 × 10−2
5.6 × 10−2
Ind GP 2.0 × 10−2
1.9 × 10−2
1.7 × 10−2
1.7 × 10−2
PP GP 2.0 × 10−2
1.9 × 10−2
1.8 × 10−2
1.8 × 10−2
Table 4: AvgMSE for Example 5.
The Ind GP approach treats each output variable independently and the
mean of the output is estimated by the predictive mean in the Gaussian
process regression.
The PP GP approach also models each output variable independently by a
Gaussian process, whereas the covariance function is shared for k
independent Gaussian processes and estimated based on all data.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 42 / 54
47. Real examples
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
5 Real examples
Humanity computer model with multiple outputs
Global gridded temperature anomalies
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 43 / 54
48. Real examples Humanity computer model with multiple outputs
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
Correctly specified models
Misspecified models
5 Real examples
Humanity computer model with multiple outputs
Global gridded temperature anomalies
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 44 / 54
49. Real examples Humanity computer model with multiple outputs
We first consider the testbed called the ‘diplomatic and military operations
in a non-warfighting domain’ (DIAMOND) simulator, which models the
number of casualties during the second day to sixth day after the earthquake
and volcanic eruption in Giarre and Catania. The input variables are 13
dimensional, such as the helicopter cruise speed, engineer ground speed,
hospital, shelter and food supply capacity in these two places, etc.
We use the same n = 120 training and n∗
= 120 testing outputs in
[Overstall and Woods, 2016] to compare different approaches. The criteria
for out of sample prediction are
RMSE =
k
j=1
n∗
i=1( ˆY ∗
j (x∗
i ) − Y ∗
j (x∗
i ))2
kn∗
,
PCI(95%) =
1
kn∗
k
j=1
n∗
i=1
1{Y ∗
j (x∗
i ) ∈ CIij(95%)} ,
LCI(95%) =
1
kn∗
k
j=1
n∗
i=1
length{CIij(95%)} .
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 45 / 54
50. Real examples Humanity computer model with multiple outputs
Method Mean function Kernel RMSE PCI (95%) LCI (95%)
GPPCA Intercept Gaussian kernel 3.33 × 102 0.948 1.52 × 103
GPPCA Selected covariates Gaussian kernel 3.18 × 102 0.957 1.31 × 103
GPPCA Intercept Mat´ern kernel 2.82 × 102 0.962 1.22 × 103
GPPCA Selected covariates Mat´ern kernel 2.74 × 102 0.957 1.18 × 103
Ind GP Intercept Gaussian kernel 3.64 × 102 0.918 1.18 × 103
Ind GP Selected covariates Gaussian kernel 4.04 × 102 0.918 1.17 × 103
Ind GP Intercept Mat´ern kernel 3.40 × 102 0.930 0.984 × 103
Ind GP Selected covariates Mat´ern kernel 3.31 × 102 0.927 0.967 × 103
Multi GP Intercept Gaussian kernel 3.63 × 102 0.975 1.67 × 103
Multi GP Selected covariates Gaussian kernel 3.34 × 102 0.963 1.54 × 103
Multi GP Intercept Mat´ern kernel 3.01 × 102 0.962 1.34 × 103
Multi GP Selected covariates Mat´ern kernel 3.05 × 102 0.970 1.50 × 103
Table 5: The GPPCA and Ind GP with the same mean structure and kernels are
given in the first 8 rows. The 9th and 10th rows show the emulation result of two
best models in [Overstall and Woods, 2016] using Gaussian kernel for the same
held-out testing output, whereas the last two rows give the result of the same
model with the Mat´ern kernel in (14). The RMSE is 1.08 × 105
using the mean
of the training output to predict.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 46 / 54
51. Real examples Humanity computer model with multiple outputs
Estimated covariance and prediction
2
3
4
5
6
2 3 4 5 6
Day
Day
2.5e+06
5.0e+06
7.5e+06
1.0e+07
Covariance
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
0 20 40 60 80 100 120
050001500025000
Held out runs
Output
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
qq
q
q
qq
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Casualties on day 5
GPPCA prediction for day 5
Ind GP prediction for day 5
Casualties on day 6
GPPCA prediction for day 6
Ind GP prediction for day 6
Figure 12: The estimated covariance of the casualties by the GPPCA at the
different days after the catastrophe is graphed in the left panel. The held out
testing output, the prediction by the GPPCA and Independent GPs with the mean
basis h(x) = (1, x11) and Mat´ern kernel for the fifth day and sixth day are
graphed in the right panel.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 47 / 54
52. Real examples Global gridded temperature anomalies
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
Correctly specified models
Misspecified models
5 Real examples
Humanity computer model with multiple outputs
Global gridded temperature anomalies
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 48 / 54
53. Real examples Global gridded temperature anomalies
NOAA global gridded temperature anomalies
The dataset from U.S. National Oceanic and Atmospheric Administration
(NOAA) the global gridded monthly anomalies of the combined air and
marine temperature from Jan 1880 to near present with 5◦
× 5◦
latitude-longitude spatial resolution. The recorded variance of the
measurement error is around 0.1.
We compare different approaches on interpolation. We use the monthly
temperature anomalies at 1, 639 spatial grid boxes in the past 20 years. We
hold out the 24, 000 randomly sampled measurements on 1, 200 spatial grid
boxes in 20 months as the test data set.
For GPPCA, the mean basis function h(x) = (1, x), where x is an integer
from 1 to 240 for the month. We also assume the covariance is the same for
all factor processes.
We will also compare with PPCA, spatial smoothing and temporal
smoothing approach. For the temporal smoothing approach, we also assume
h(x) = (1, x).
Random forest regression by making independence assumption either across
space or across time.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 49 / 54
54. Real examples Global gridded temperature anomalies
Method measurement error RMSE PCI (95%) LCI (95%)
GPPCA, d = 50 estimated 0.392 0.877 1.03
GPPCA, d = 100 estimated 0.330 0.774 0.564
GPPCA, d = 50 fixed 0.392 0.938 1.34
GPPCA, d = 100 fixed 0.335 0.976 1.44
PPCA, d = 50 estimated 0.644 0.674 1.09
PPCA, d = 100 estimated 0.644 0.520 1.40
PPCA, d = 50 fixed 0.641 0.760 1.33
PPCA, d = 100 fixed 0.622 0.801 1.400
Temporal smoothing by GP estimated 1.02 0.940 2.36
Spatial smoothing by GP estimated 0.623 0.917 1.95
Temporal regression by RF estimated 0.497 / /
Spatial regression by RF estimated 0.444 / /
Table 6: Out of sample prediction of the temperature anomalies by different
approaches. The predictive performance by the GPPCA and PPCA are given in
the first four rows and latter four rows. The predictive performance by the
temporal smoothing method and spatial smoothing methods are given in the 9th
and 10th rows. The last two rows give the predictive RMSE by regression using
the random forest (RF) algorithm.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 50 / 54
55. Real examples Global gridded temperature anomalies
Comparison between the GPPCA and spatial smoothing
−6
−4
−2
0
2
4
6
°C
50 150 250 350
−50
0
50
Interpolion by the GPPCA
Longitude
Latitude
−6
−4
−2
0
2
4
6
°C
50 150 250 350
−50
0
50
Observated temperature anomalies
Longitude
Latitude
−6
−4
−2
0
2
4
6
°C
50 150 250 350
−50
0
50
Interpolion by the spatial smoothing method
Longitude
Latitude
Figure 13: The interpolated and observed temperature anomalies in April 2013.
The observed temperature anomalies in April 2013 is graphed in the middle panel.
The interpolated temperature anomalies by the GPPCA and spatial smoothing
method are graphed in the left and right panels, respectively. The number of
training observations and test observations are 439 and 1200, respectively. The
out-of-sample RMSE of the GPPCA and spatial smoothing method is 0.335 and
0.779, respectively.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 51 / 54
56. Real examples Global gridded temperature anomalies
Estimated Intercept and trend by the GPPCA
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
°C
50 150 250 350
−50
0
50
Estimated intercept
Longitude
Latitude
−0.015
−0.010
−0.005
0.000
0.005
0.010
0.015
°C
50 150 250 350
−50
0
50
Estimated monthly temperature change rate
Longitude
Latitude
Figure 14: Estimated intercept and monthly change rate of the temperature
anomalies by the GPPCA using the monthly temperature anomalies between
January 1999 and December 2018.
The spatial orthonormal basis of A could be used. GPPCA is more
general as it does not require the distance between functions.
The GPPCA can be extended to the irregular missing case by the EM
algorithm or MCMC algorithm if one can specify the full posteriors.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 52 / 54
57. Future directions
Outline
1 Introduction
2 Generalized probabilistic principal component analysis (GPPCA)
3 GPPCA with a mean structure
4 Simulated examples
5 Real examples
6 Future directions
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 53 / 54
58. Future directions
Future directions
The full Bayesian approach of the factor loading matrix and
parameters (based on the computationally feasible marginal
likelihood).
Estimating the number of factors.
Convergence rate of the GPPCA.
Extension when the observations are not a matrix.
Optimization algorithm on the Stiefel manifold.
Other orthonormal basis of the factor loading matrix.
Other ways to model the factor processes.
Heteroscedastic noise.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 54 / 54
59. Future directions
Reference
Gu, M. and Shen, W. (2018) Generalized probabilistic principal component
analysis (GPPCA) for correlated data. arXiv:1808.10868.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 54 / 54
61. Future directions
Related literature: frequentist approaches
The MLE of the factor loading matrix A under the Assumption 1 is U0R
(without marginalizing out Z), where ˜U0 is the first d ordered eigenvectors
of YYT
and R is an orthogonal rotation matrix (same as the PCA).
PCA is widely used in factor models, particularly in modeling multiple time
series. E.g. Bai and Ng [2002] and Bai [2003] assume that AT
A = kId and
estimate A by
√
kU0 in modeling high-dimensional time series.
PCA is also widely used to estimate the basis in the linear model of
coregionalization [Higdon et al., 2008, Paulo et al., 2012].
In [Tipping and Bishop, 1999], the linear subspace by the PCA is the MMLE
of the factor model with independent factors.
[Lam et al., 2011, Lam and Yao, 2012] estimate the factor loading matrix of
model (1) by ˆALY :=
q0
q=1
ˆΣy(q)ˆΣT
y (q), where ˆΣy(q) is the k × k sample
covariance at lag q of the output and q0 is fixed to be a positive integer.
Kernel PCA was introduced in machine learning, which map the output onto
the feature space by kernels [Sch¨olkopf et al., 1998, Mika et al., 1999,
Hoffmann, 2007].
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 54 / 54
62. Future directions
Related literature: Bayesian approaches
[West, 2003] points out the connection between PCA and a class of
generalized singular g-priors, and introduces a spike-and-slab prior
that induces the sparse factors in the latent factor model assuming
the factors are independently distributed.
Another prior that induces the sparsity is introduced by [Bhattacharya
and Dunson, 2011] under the independent assumptions of the factors,
and its asymptotic behaviors are also discussed.
[Nakajima and West, 2013, Zhou et al., 2014] introduce a method to
directly threshold the time-varying factor loading matrix in Bayesian
dynamic linear models.
When modeling spatially correlated data, priors are also discussed for
the spatially varying factor loading matrices in LMC [Gelfand et al.,
2004, Banerjee et al., 2014].
[Higdon et al., 2008, Paulo et al., 2012, Fricker et al., 2013] use LMC
for emulating computer models with multiple outputs, estimates the
factor loading matrix and relies on MCMC algorithm for the inference.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 54 / 54
63. References
P-A Absil, Alan Edelman, and Plamen Koev. On the largest principal
angle between random subspaces. Linear Algebra and its applications,
414(1):288–294, 2006.
Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for
vector-valued functions: A review. Foundations and Trends R in
Machine Learning, 4(3):195–266, 2012.
Jushan Bai. Inferential theory for factor models of large dimensions.
Econometrica, 71(1):135–171, 2003.
Jushan Bai and Serena Ng. Determining the number of factors in
approximate factor models. Econometrica, 70(1):191–221, 2002.
Sudipto Banerjee, Bradley P Carlin, and Alan E Gelfand. Hierarchical
modeling and analysis for spatial data. Crc Press, 2014.
Anirban Bhattacharya and David B Dunson. Sparse Bayesian infinite
factor models. Biometrika, pages 291–306, 2011.
ke Bj¨orck and Gene H Golub. Numerical methods for computing angles
between linear subspaces. Mathematics of computation, 27(123):
579–594, 1973.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 54 / 54
64. References
Thomas E Fricker, Jeremy E Oakley, and Nathan M Urban. Multivariate
Gaussian process emulators with nonseparable covariance structures.
Technometrics, 55(1):47–56, 2013.
Alan E Gelfand, Alexandra M Schmidt, Sudipto Banerjee, and CF Sirmans.
Nonstationary multivariate process modeling through spatially varying
coregionalization. Test, 13(2):263–312, 2004.
Mengyang Gu. FastGaSP: Fast and Exact Computation of Gaussian
Stochastic Process, 2019. URL
https://CRAN.R-project.org/package=FastGaSP. R package
version 0.5.1.
Mengyang Gu and Kyle Anderson. Calibration of imperfect mathematical
models by multiple sources of data with measurement bias. arXiv
preprint arXiv:1810.11664, 2018.
Mengyang Gu and James O Berger. Parallel partial Gaussian process
emulation for computer models with massive output. Annals of Applied
Statistics, 10(3):1317–1347, 2016.
Mengyang Gu and Yanxun Xu. Nonseparable Gaussian stochastic process:
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 54 / 54
65. References
A unified view and computational strategy. arXiv preprint
arXiv:1711.11501, 2017.
Jouni Hartikainen and Simo Sarkka. Kalman filtering and smoothing
solutions to temporal gaussian process regression models. In Machine
Learning for Signal Processing (MLSP), 2010 IEEE International
Workshop on, pages 379–384. IEEE, 2010.
Dave Higdon, James Gattiker, Brian Williams, and Maria Rightley.
Computer model calibration using high-dimensional output. Journal of
the American Statistical Association, 103(482):570–583, 2008.
Heiko Hoffmann. Kernel PCA for novelty detection. Pattern recognition,
40(3):863–874, 2007.
Clifford Lam and Qiwei Yao. Factor modeling for high-dimensional time
series: inference for the number of factors. The Annals of Statistics, 40
(2):694–726, 2012.
Clifford Lam, Qiwei Yao, and Neil Bathia. Estimation of latent factors for
high-dimensional time series. Biometrika, 98(4):901–918, 2011.
Sebastian Mika, Bernhard Sch¨olkopf, Alex J Smola, Klaus-Robert M¨uller,
Matthias Scholz, and Gunnar R¨atsch. Kernel PCA and de-noising in
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 54 / 54
66. References
feature spaces. In Advances in neural information processing systems,
pages 536–542, 1999.
Jouchi Nakajima and Mike West. Bayesian analysis of latent threshold
dynamic models. Journal of Business & Economic Statistics, 31(2):
151–164, 2013.
Antony M Overstall and David C Woods. Multivariate emulation of
computer simulators: model selection and diagnostics with application
to a humanitarian relief model. Journal of the Royal Statistical Society:
Series C (Applied Statistics), 65(4):483–505, 2016.
Rui Paulo, Gonzalo Garc´ıa-Donato, and Jes´us Palomo. Calibration of
computer models with multivariate output. Computational Statistics
and Data Analysis, 56(12):3959–3974, 2012.
Bernhard Sch¨olkopf, Alexander Smola, and Klaus-Robert M¨uller.
Nonlinear component analysis as a kernel eigenvalue problem. Neural
computation, 10(5):1299–1319, 1998.
Matthias Seeger, Yee-Whye Teh, and Michael Jordan. Semiparametric
latent factor models. Technical report, 2005.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 54 / 54
67. Future directions
Michael E Tipping and Christopher M Bishop. Probabilistic principal
component analysis. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 61(3):611–622, 1999.
Zaiwen Wen and Wotao Yin. A feasible method for optimization with
orthogonality constraints. Mathematical Programming, 142(1-2):
397–434, 2013.
M. West. Bayesian factor regression models in the “large p, small n”
paradigm. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. David,
D. Heckerman, A. F. M. Smith, and M. West, editors, Bayesian
Statistics 7, pages 723–732. Oxford University Press, 2003. URL
http://ftp.isds.duke.edu/WorkingPapers/02-12.html.
Peter Whittle. On stationary processes in the plane. Biometrika, pages
434–449, 1954.
Xiaocong Zhou, Jouchi Nakajima, and Mike West. Bayesian forecasting
and portfolio decisions using dynamic dependent sparse factor models.
International Journal of Forecasting, 30(4):963–980, 2014.
Mengyang Gu (Johns Hopkins University) GPPCA SAMSI BFF Conference 54 / 54