1. Self-made project: Predictive Modelling and
Multivariate Analysis using climate data
October 1, 2016
Harris Phan
1 Introduction/Aim
To find the significant factors which contribute to Ice free days in Kotzbehue,
Alaska. The dataset can be found on this website:
http://www.multivariatestatistics.org/data.html
There are many factors that could contribute to icy weather in any country.
To be able to predict any statistical model, one must consider data from the
past. The data that has been recorded will be used to predict how many icy
days will there be in the future. However, sometimes datasets have factors that
seem similar to each other, which means they could be made redundant thus the
need for multivariate analysis techniques such as principal components analysis
and factor analysis. The first part of this report will consist of multivariate
analysis techniques, starting with the multivariate test of normality, Hotelling’s
T2
, multiple and partial correlation, principal components analysis and factor
analysis. The second part will consists of predictive modelling techniques such
as time series regression (possibly neural networks). SAS will be used for this
report.
1
2. 2 Data and description
Data Description
Year The Year in which the data was recorded from 1981-2003.
AO
This stands for Arctic Oscillation.
https://www.ncdc.noaa.gov/teleconnections/ao/
for more information. AOsumm represents the AO in the summer
and AOwint represents the AO in the winter.
NPI
This stands for the North Pacific Index, which is the area-weighted sea
level pressure. NPIspring represents the NPI in spring and NPIwinter
represents the NPI in winter. More information can be found here:
https://climatedataguide.ucar.edu/
climate-data/north-pacific-np-index
-trenberth-and-hurrell-monthly-and-winter
Temp
Represents the temperature in degrees celsius. TempSummer rep-
resents the temperature in summer and TempWinter represents the
temperature in winter.
Rain
This represents the amount of rainfall. RainSumm represents how
much rain fell in summer and RainWint represents how much rain
fell in winter.
Ice
This represents how much ice was there during the whole year. Ice-
JanJul represents how much ice was there between January and July
on averageand IceOctDec represents how much ice was there on av-
erage between October and December.
IceFreeDays
This represents how many days that were ice-free in Kotzbehue,
Alaska.
3 Multivariate analysis
3.1 Tests of multivariate normality
In order to assume that 3 or more variables have reasonable multivariate normal
characteristics, 2 test statistics will be used here: Mardia’s multivariate skewness
and kurtosis measures. A hypothesis test will be conducted:
H0 : (X1, X2, X3, X4, X5)T
is multivariate normal
H1 : (X1, X2, X3, X4, X5)T
is not multivariate normal.
2
3. Now we use Mardia’s multivariate skewness (κ1) and kurtosis measures (κ2),
which is given by:
κ1 = nˆβ1,p/6 ∼ χ2
p(p+1)(p+2)/6
κ2 = [ˆβ2,p − p(p + 2)]/[8p(p + 2)/n]
1
2 ∼ N(0, 1).
Since there is no default procedure, I will put down my code here:
r = nrow(x);
c = ncol(x);
dfc = c*(c+1)*(c+2)/6;
q = i(r) - (1/r)*j(r,r,1);
s = (1/(r))*x‘*q*x ; s_inv = inv(s) ;
g_matrix = q*x*s_inv*x‘*q ;
beta1hat = ( sum(g_matrix#g_matrix#g_matrix) )/(r*r) ;
beta2hat =trace( g_matrix#g_matrix )/r ;
kappa1 = r*beta1hat/6 ;
kappa2 = (beta2hat - c*(c+2) ) /sqrt(8*c*(c+2)/r) ;
pvalskew = 1 - probchi(kappa1,dfc) ;
pvalkurt = 2*( 1 - probnorm(abs(kappa2)) ) ;
print s ;
print s_inv ;
print beta1hat ;
print kappa1 ;
print pvalskew ;
print beta2hat ;
print kappa2 ;
print pvalkurt ;
quit;
Suppose we let X1 = AO, X2 = NPI, X3 = Temp, X4 = Rain and X5 =
Ice. By our SAS output, κ1 = 23.527265, which has a respective p-value of
0.930064. This means at the 0.05 significance level, there is not enough evidence
to reject the null hypothesis. However we also need κ2, which is 29.913077 and
corresponds to the kurtosis of the multivariate normal distribution, which has
a respective p-value of 0.1448567 hence at the 0.05 significance level there is
enough evidence to reject the null hypothesis. This means that the multivariate
normality assumption is not in reasonable agreement with the data. Note that
this is only considering the year round factors, season has not been accounted
for. Therefore we can proceed with Hotelling’s T2
.
3.2 Hotelling’s T2
We shall compare the means of rainfall and Temperature in the summer and the
means of rainfall and Temperature in the winter. The primary reason for using
3
4. Hotelling’s T2
is to see if the seasons have an effect on both the temperature
and the rainfall. The big hint here is to use the hypothesis:
H0 : E
X1 − X2
X3 − X4
= E
Y1
Y2
= 0.
This hypothesis can be formulated by using the matrix:
Y =
1 0 −1 0
0 1 0 −1
X.
The SAS code is provided here:
proc corr cov noprob nocorr outp = outcovmatrix;
var x1-x4;
quit;
proc iml;
use outcovmatrix where(_TYPE_="COV");
read all var _NUM_ into cov[colname=varNames];
use outcovmatrix where(_TYPE_="MEAN");
read all var _NUM_ into meansMatrix[colname=varNames];
print cov, meansMatrix;
Sigma_11 = cov[ {3 4}, {3 4}];
Sigma_22 = cov[ {1 2}, {1 2}];
Sigma_12 = cov[ {3 4}, {1 2}];
Sigma_21 = cov[ {1 2}, {3 4}];
mu_1 = meansMatrix[{3 4}];
mu_2 = meansMatrix[{1 2}];
SInv = inv(cov);
print SInv;
y = {1 0 -1 0, 0 1 0 -1};
covyx = y*cov*y‘;
print covyx;
covyxinv = inv(covyx);
print covyxinv;
mudiff_1 = meansMatrix[{1}]-meansMatrix[{2}];
mudiff_2 = meansMatrix[{3}]-meansMatrix[{4}];
print mudiff_1;
print mudiff_2;
n = 23;
p = 4;
mu_vect = {27.396087,3.2408696};
T2 = 23* mu_vect‘ * covyxinv * mu_vect;
print T2;
critical_value_of_F = p*(n-1)/(n-p)*finv(.95, p, n-p);
print critical_value_of_F;
quit;
4
5. Note that the formula for Hotelling’s T2
is given by:
T2
= n(¯X − µ0) S−1
(¯X − µ0) ∼
(n − 1)p
n − p
Fp,n−p,
By the SAS output, Hotelling’s T2
is 5631.1594, which is greater than the critical
value of 13.408918. This means there is enough evidence to reject the null
hypothesis and thus there is a difference between the means of the rainfall
and the temperature in the summer and the means of the rainfall and the
temperature in the winter.
Now do something similar, except this time we compare the means of AO and
Rain in the summer and the means of AO and rain in the winter. The Hotelling’s
T2
value is 5945.4702, which is greater than the critical value of 13.408918.
Therefore overall, summer conditions differ greatly from winter conditions.
3.3 Simple, Partial and Multiple correlation
We now wish to find out the strength of the relationships between variables. To
do this, correlation procedures are used. If the variables are highly correlated
with each other, that makes them a potential candidate for principal components
analysis and factor analysis, which will be done later. In thus subsection, the
same types of variables will be tested. That is test the correlation of A0 like
variables, NPI like variables, Temperature like variables, Rain like variables and
Ice like variables. Then we test to see if there is a strong correlation between
days with no ice and other variables. Firstly the AO variables will be tested.
Simple correlation in SAS gives us,
proc corr data=climatedata;
var AO AO_wint AO_summ;
run;
AO AOwint AOsumm
AO 1.00000 0.58909 0.56242
AOwint 0.58909 1.00000 0.93234
AOsumm 0.56242 0.93234 1.00000
So there is a strong correlation between Arctic Oscillation in Winter and Arctic
Oscillation in Winter. However the overall Arctic Oscillation is not that strongly
correlated between the Winter and Summer Arctic Oscillation. Now consider
the NPI like variables,
proc corr data=climatedata;
var NPI NPI_spring NPI_winter;
run;
NPI NPIspring NPIwinter
NPI 1.00000 0.18645 0.99790
NPIspring 0.18645 1.00000 0.19444
NPIwinter 0.99790 0.19444 1.00000
5
6. In this case, only the NPI in winter has a strong correlation with the overall NPI.
The others are weakly correlated with each other. However out of curiousity I
decided to try and use partial correlation.
proc corr data=climatedata;
var NPI NPI_spring;
partial NPI_winter;
run;
The correlation between NPI and NPIspring with NPIwinter being taken as
the partial variable is -0.11919, which makes it negatively correlated but still
weak. However making NPIwinter the variable and NPIspring the partial gives
a correlation coefficient closer to 1 which means taking the spring NPI into
account, the winter NPI becomes more strongly correlated with the overall NPI.
3.4 Principal components analysis
We now do Principal Components analysis. This is a variable reduction proce-
dure and in the climate dataset, there seems to be variables that are correlated
with each other. It is advised that when using Principal components analysis
requires a large sample size. This is due to correlations needing large sample
size before they stabilise. The SAS princomp procedure would allow us to do
Principal components analysis. There are 3 parts to the princomp procedure:
The simple statistics and the correlation matrix, eigenvalues of the correlation
matrix and the eigenvectors and finally the scree plot/variance explained. The
eigenvalues are simply the variances of the principal components themselves.
proc princomp data= climatedata;
var Ice Ice_JanJul Ice_OctDec;
run;
What is important here is the scree plot as it graphs the eigenvalue against the
component number. For the 3 candidate variables Ice, IceJanJul and IceOctDec,
By the minigen criterion, only one of the eigenvalues is above 1 and it should
be the only one that is retained. Components with an eigenvalue of less than 1
become of little use. So according to principal components analysis, seasonality
does not really matter much so only Ice is retained.
proc princomp data= climatedata;
var AO AO_wint AO_summ;
run;
Even though AOsumm and AOwint have a correlation of above 0.9, I will still
continue with it in the Principal Components Analysis. Again like the Ice-like
variable, there is only one eigenvalue which is above 1 when it comes to AO-like
variable. So again it seems that according to PCA, seasonality does not matter
6
7. and we retain AO was the variable.
Using a similar princomp procedure we see that only one NPI-like variable
should be used.
proc princomp data= climatedata;
var Temp Temp_summ Temp_wint;
run;
However, Temperature-like variables become more tricky to handle. We now get
eigenvalues of 1.12191146, 1.02992761 and 0.84816093. It is clear from the cor-
relation matrix provided in princomp that the temperature in summer is weakly
correlated to the yearly temperature. Same goes for the winter temperature.
Thus seasonal variance should be accounted for. Therefore by the eyeball test,
none of the Temperature-like variables should be discarded.
3.5 Factor Analysis
7