1. STAT7001
Computing for Practical Statistics
In-Course Assessment 2
TASK 1: PREDICTION OF THE ISTANBUL STOCK MARKET 2
TASK 2: THE RESISTANCE OF CONSTANTIN 7
APPENDIX A: TASK 1 R CODE 11
APPENDIX B: TASK 2 SAS CODE 26
APPENDIX C: REFERENCES 33
2. Task 1: Prediction of the Istanbul Stock Market
Task 1: Prediction of the Istanbul Stock Market
Main Question
The main task was to use different prediction strategies to predict the daily returns of the Istanbul Stock
Exchange (ISE) index based on the data of ISE returns as well as the returns of 7 other stock indices; and
compare the performance of these prediction methods by calculating error measures such as RMSE, MAE,
and the relative variants of these.
For the following report, we apply significance level 5% to all analyses.
Summary
For benchmarking experiments where ISE returns were predicted based on data from the same day, the
models based on other stock indices were significantly better than taking the mean ISE return as a predictor;
while the inclusion of time did not result in any significant changes in the goodness of prediction.
For benchmarking experiments where predictions were made only based on previous data, the reverse was
observed as predictors based only on prior ISE returns performed significantly better than models based on
previous stock index returns, suggesting a non-linear relationship may exist between ISE returns and that of
previous days.
Exploratory Data Analysis
Figures 1 to 8. Scatter plots of stock index returns (y-axis) against number of days since earliest record (x-axis),
with respective correlation estimates and p-values.
From the scatter plots in Figures 1 to 8, it can be seen that there is no apparent association between the
returns of stock indices and time, as the location of the index returns do not appear to change with time. A
correlation test was performed on each of the stock index returns and time, with the results indicating no
apparent linear association between the variables at 95% confidence, as all p-values were greater than 0.05.
Figure 1. ISE:
cor=-0.0499, p-value=0.2485
Figure 2. S&P 500:
cor=0.0245, p-value=0.5714
Figure 3. DAX:
cor=0.0299, p-value=0.4891
Figure 4. FTSE 100:
cor=0.0190, p-value=0.6615
Figure 5. Nikkei 225:
cor=0.00533, p-value=0.9019
Figure 6. Ibovespa:
cor=-0.0582, p-value=0.1786
Figure 7. MSCI EU:
cor=0.0121, p-value=0.7803
Figure 8. MSCI EM:
cor=-0.0538, p-value=0.2136
3. Task 1: Prediction of the Istanbul Stock Market
Figures 9 to 16. Scatter plots of stock index returns for day N-1, N-2 and N-3 (y-axis) respectively (left to right)
against stock index returns for day N (x-axis), with respective correlation estimates and p-values.
The scatter plots in Figures 9 to 16 shows that there is generally no patterns in the stock indices returns
against its returns one, two, and three days before. A correlation test has been carried out to confirm this and
it suggests that there is no correlation for the all scatter plots except the first one in Figure 16, which shows
that there is slightly positive association (cor=0.149) between MSCI EM returns and its returns one day
earlier (p-value=0.0005403<0.05).
In the light of the above, it may be reasonable to suggest that stock indices returns are not linearly related
with the recent past few days in general.
cor = 0.0188
p-value = 0.6651
cor = -0.0124
p-value = 0.7745
cor = -0.0337
p-value = 0.4369
Figure 9. ISE
cor = -0.0608
p-value = 0.1605
cor = -0.0300
p-value = 0.4888
cor = -0.00845
p-value = 0.8456
Figure 10. S&P
500
cor = 0.00132
p-value = 0.9758
cor = -0.0262
p-value = 0.5453
cor = -0.0172
p-value = 0.6919
Figure 11. DAX
cor = -0.00739
p-value = 0.8646
cor = -0.0276
p-value = 0.5248
cor = -0.0218
p-value = 0.6149
Figure 12. FTSE 100
cor = -0.0782
p-value = 0.07085
cor = 0.0261
p-value = 0.5479
cor = 0.000953
p-value = 0.9825
Figure 13. Nikkei 225
cor = -0.0485
p-value = 0.2626
cor = -0.0140
p-value = 0.7463
cor = -0.0457
p-value = 0.2921
Figure 14. Ibovespa
cor = 0.00995
p-value = 0.8184
cor = -0.0420
p-value = 0.3323
cor = 0.00254
p-value = 0.9533
Figure 15. MSCI EU
cor = 0.149
p-value = 0.0005403
cor = -0.0141
p-value = 0.7449
cor = 0.0489
p-value = 0.2599
Figure 16. MSCI EM
4. Task 1: Prediction of the Istanbul Stock Market
Results and Interpretation of Prediction from Same Day Indices (Part C)
Table 1. Table of error measures and their confidence intervals for different prediction methods, under
respective validation set-ups.
Validation Set-up
Prediction
Methods
Linear Model
w/o Time
Linear Model
w/ Time
Robust Linear
Regression
Chronological
80-20 Split
Mean
V = 3695
p-value = 0.0123
V = 3688
p-value = 0.01307
V = 3612
p-value = 0.02473
Linear Model
w/o Time
V = 3031
p-value = 0.6601
V = 3135
p-value = 0.4455
Linear Model
w/ Time
V = 3163
p-value = 0.3953
5-Fold
Cross-Validation
Mean
V = 96316
p-value = 1.121e-11
V = 96500
p-value = 7.849e-12
V = 96397
p-value = 9.587e-12
Linear Model
w/o Time
V = 72898
p-value = 0.7934
V = 68503
p-value = 0.3356
Linear Model
w/ Time
V = 68297
p-value = 0.3075
Table 2. Table of results for paired Wilcoxon signed-rank tests on absolute residuals of different prediction
methods.
The similar error measures for the linear models with and without time, as shown in Table 1, indicate that
both models seem to perform as well as each other, for both the chronological and 5-fold cross-validation
set-ups. This is confirmed by the paired Wilcoxon signed-rank test results in Table 2, with p-values of
0.6601 and 0.7934 (for chronological and 5-fold respectively) indicating that there is no significant
difference in absolute residuals for the two models.
These results agree with the preliminary conclusions derived from the exploratory findings. Since it was
found that there is no apparent association between the stock index returns and time, thus it should follow
that the linear models with or without time should perform just as well as each other in predicting ISE returns,
as the addition of the time variable does not provide significant information.
Additionally, a robust linear regression model (RLR) was created, predicting the ISE return based on the
returns of other stock indices for the same day, without time. This also produced similar error measures to
the ordinary least squares regression models, and paired Wilcoxon signed-rank test results similarly indicated
no significant difference in absolute residuals (p-values of 0.4455 and 0.3953 for chronological; p-values of
0.3356 and 0.3075 for 5-fold).
However, these 3 models have RMSE and MAE values which are considerably smaller than that of the
prediction using mean ISE return for the training data set, as seen in Table 1. This is confirmed by the paired
Validation Set-up
Prediction
Method
RMSE MAE
Relative
RMSE
Relative
MAE
Chronological
80-20 Split
Mean
0.0131
(0.0112, 0.0149)
0.0100
(0.0084, 0.0116)
1.68
(1.13, 2.23)
1.22
(1.00, 1.44)
Linear Model
w/o Time
0.0108
(0.0094, 0.0122)
0.00855
(0.00729, 0.00980)
3.35
(1.63, 5.07)
1.61
(1.05, 2.17)
Linear Model
w/ Time
0.0107
(0.0093, 0.0121)
0.00852
(0.00729, 0.00975)
3.06
(1.63, 4.49)
1.56
(1.06, 2.06)
Robust Linear
Regression
0.0105
(0.0091, 0.0119)
0.00845
(0.00726, 0.00964)
3.04
(1.33, 4.74)
1.53
(1.03, 2.03)
5-Fold
Cross-Validation
Mean
0.0162
(0.0134, 0.0191)
0.0121
(0.0100, 0.0141)
1.49
(1.03, 1.95)
1.14
(0.96, 1.32)
Linear Model
w/o Time
0.0120
(0.0100, 0.0140)
0.00920
(0.00773, 0.01067)
7.43
(1.30, 13.55)
2.11
(0.76, 3.46)
Linear Model
w/ Time
0.0120
(0.0100, 0.0140)
0.00920
(0.00773, 0.01066)
7.46
(1.30, 13.62)
2.11
(0.75, 3.47)
Robust Linear
Regression
0.0121
(0.0101, 0.0142)
0.00928
(0.00780, 0.01076)
6.85
(1.36, 12.34)
2.02
(0.78, 3.26)
5. Task 1: Prediction of the Istanbul Stock Market
Wilcoxon signed-rank tests. For the chronological set-up, the p-values of 0.0123, 0.01307, and 0.02473 (for
tests of mean vs. LM w/o time, LM w/ time and RLR respectively) suggest some evidence of a difference in
the absolute residuals of the models. For the 5-fold set-up, the p-values of 1.121e-11, 7.849e-12 and 9.587e-
12 respectively, suggest strong evidence of a significant difference in the absolute residuals from the models.
This allows us to conclude with 95% confidence that the mean ISE return is a worse prediction method than
any of the other 3 models, under both validation set-ups.
Results and Interpretation of Prediction from Previous Day Indices (Part D)
Validation
Set-up
Prediction Method RMSE MAE
Relative
RMSE
Relative
MAE
11
Consecutive
Days
Most Recent
ISE Return
0.0221
(0.0203, 0.0240)
0.0170
(0.0158, 0.0182)
16.31
(6.83, 25.78)
4.23
(2.88, 5.57)
Mean ISE Return of
Recent 5 Days
0.0173
(0.0159, 0.0187)
0.0130
(0.0120, 0.0140)
4.50
(3.11, 5.89)
1.97
(1.63, 2.32)
LM - Most
Recent Day
0.724
(0.319, 1.129)
0.169
(0.110, 0.230)
221.3
(91.4, 351.2)
43.1
(24.5, 61.7)
LM - Most
Recent 2 Days
2.59
(0.31, 4.86)
0.274
(0.054, 0.494)
280.0
(134.7, 425.3)
49.1
(25.5, 72.6)
Robust Linear
Regression
0.0634
(0.0397, 0.0870)
0.0344
(0.0298, 0.0389)
25.89
(17.98, 7.98)
8.79
(6.70, 10.87)
Table 3. Table of error measures and their confidence intervals for different prediction methods.
Validation
Set-up
Prediction
Method
Mean ISE Return of
Recent 5 Days
LM - Most
Recent Day
LM - Most
Recent 2 Days
Robust Linear
Regression
11
Consecutive
Days
Most Recent
ISE Return
V = 95496
p-value = 5.857e-14
V = 15103
p-value < 2.2e-16
V = 12427
p-value < 2.2e-16
V = 100020
p-value < 2.2e-16
Mean ISE Return
of Recent 5 Days
V = 10271
p-value < 2.2e-16
V = 7468
p-value < 2.2e-16
V = 111260
p-value < 2.2e-16
LM - Most
Recent Day
V = 65944
p-value = 0.3359
V = 25768
p-value < 2.2e-16
LM - Most
Recent 2 Days
V = 31006
p-value < 2.2e-16
Table 4. Table of results for paired Wilcoxon signed-rank tests on absolute residuals of different prediction
methods.
The error measures from the LM of stock index returns from the most recent day are all smaller than that of
the LM from the most recent 2 days. However, the large standard error of these error measures suggest that
this difference might not be significant; and this is confirmed by the paired Wilcoxon signed-rank test, with a
p-value of 0.3359 indicating no significant difference in absolute residuals from both these models.
A robust linear regression model was also created, predicting the ISE return based on the returns of stock
indices from the most recent day. These are the same covariates as in the LM for most recent day; however,
the method of obtaining the coefficients for each covariate is different and more robust, so the custom
regression shows lower error values for all measures of prediction goodness. This is confirmed by the paired
Wilcoxon signed-rank test, with p-value < 2.2e-16 indicating a significant difference in the absolute residuals
from the two models.
However, the prediction method of using mean ISE return of recent 5 days shows the lowest value for each
error measure. Furthermore, the upper bound of the 95% CI of all 4 of its error measures are lower than the
lower bound of the 95% CI of the error measures from all other models. The p-values from the paired
Wilcoxon signed-rank tests (5.857e-14, <2.2e-16, <2.2e-16, <2.2e-16) of this method against all other
methods also provide support that there are significant differences in the absolute residuals obtained from the
model. Thus, there is a strong evidence to suggest that the prediction method of using the mean ISE return of
recent 5 days is the best method among the five different prediction methods used in this benchmarking
experiment.
6. Task 1: Prediction of the Istanbul Stock Market
It should be noted that in the initial exploratory analysis, it was concluded that there seems to be no linear
association between ISE stock index returns and its returns one, two, and three days before (Figure 9).
However, the benchmark experiment in Part D suggests that the prediction method based on the recent 5
days appears to be the best prediction method, which is contradictory to the results of the exploratory
analysis. This may suggest that there might be non-linear associations that exist between ISE stock index
returns and its returns on the days before, thus allowing prediction to be made based on previous ISE returns.
Alternatively, this may have happened due to poorly designed prediction methods, with the mean ISE return
of the recent 5 days performing relatively better than the rest.
Conclusion
To assess the several prediction methods for the Istanbul stock market, two different benchmarking
experiments have been performed in this task. One is predicted from other indices on the same day and the
other one is predicted from all indices including ISE itself from previous recent days.
The first benchmarking experiment showed that the prediction methods with least squares regression based
on other data obtained from the same day, generally performed better than ones using the mean ISE return as
a predictor, in terms of error measures such as RMSE and MAE. This was confirmed by the paired Wilcoxon
signed-rank tests on the absolute residuals from the different prediction methods. Additionally, the inclusion
of time as an additional covariate did not result in any significant changes in the goodness of prediction of
the models, concurring with the results of the exploratory analysis.
On the other hand, in the second benchmarking experiment, there was sufficient evidence to support the
opposite situation to the first benchmarking experiment, where prediction models based only on prior ISE
returns performed significantly better than models based on previous returns on all stock indices. This
contradicts the results of the exploratory analysis, where there was no significant linear association found
between ISE returns and its returns from days before, thus suggesting a possible non-linear relationship not
identified by the correlation test.
Comparing the benchmarking experiments in Part C and D, it can be seen that the error measures calculated
in Part C tend to be smaller than those in Part D, in general. This might suggest that prediction models based
on data from the same day are better at predicting the ISE returns than prediction models based only on
recent previous data. This might indicate that data from the same day provides better information, or has
closer associations to, the ISE returns.
7. Task 2: The Resistance of Constantin
Task 2: The Resistance of Constantin
Main Question
The data from the 8th
edition of the “Rubber Bible”, contains 16 data points of resistance of Constantin wire
at different diameters. The main task was to fit different regression models to explain resistance in terms of
diameter and investigate the goodness of fit of these models by obtaining estimates of error measures such as
RMSE and MAE.
The significance level of 5% was applied to all analyses for the following report.
Summary
Based on the investigation, regression models that involved logarithmic or reciprocal transformations
generally had higher goodness of fit. It was found that the regression model of 1/d2
best explained the
relationship between resistance and diameter, producing the simplest model with high goodness of fit and the
smallest residuals.
The model of 1/d2
+ 1/d produced a similarly performing model, but the covariate 1/d was found to be
insignificant; while the log-transformed model produced residuals that were larger than the 1/d2
model.
Meanwhile, fitting resistance to a polynomial of degree 15 in diameter produced an over-fitted rank-deficient
model.
Explorative Analysis
Figure 1 (a) and (b): Histograms of both variables R and d.
Both resistance and diameter variables have positive values below 1. From the histograms, both can be
deduced to be positively skewed, justified with the positive skewness, 2.9870 and 1.3166 respectively. To
shrink the larger values more than the smaller ones, power or log transformation can be used to result in a
distribution that is more symmetric. However, as all the values are between 0 and 1, power transformation
might not achieve the desired effect.
From the scatter plot (Figure 2), we can see that as
diameter increases, the resistance decreases. There
is clearly a decreasing non-linear relationship
between the two variables. Suggested
transformations on the data could include
logarithmic or reciprocal transformations to
straighten out the bivariate non-linear relationship.
Figure 2: Scatterplot of Resistance of Constantin wire and its diameter.
8. Task 2: The Resistance of Constantin
Regression Models
There are 4 suggested regression models fitted. Namely,
model 1, log(R)=log(d);
model 2, R=d+d2
+d3
+ d4
+ d5
+d6
+ d7
+d8
+ d9
+d10
+ d11
+d12
+ d13
+d14
+d15
;
model 3, and
model 4, .
For model 1, the fit plot is negatively linear, as the logarithmic
transformation of both variables gives negative values. Log(d) is a
significant parameter as its p-value is lower than 0.05. The R-square of 1.00 indicates the model is a good fit.
Meanwhile, model 2 is a rank-deficient least squares model. As such, the least-squares solutions for the
parameters are not unique, producing biased estimates with some misleading statistics.
Parameter Estimates R-Square
Variable Parameter Estimate Standard Error t Value Pr > |t| 0.9944
Intercept 3.11362 0.21165 14.71 <.0001
d -417.05833 43.70357 -9.54 <.0001
d2 23289 3277.49356 7.11 0.0004
d3 -685537 119274 -5.75 0.0012
d4 11541180 2347752 4.92 0.0027
d5 -113141376 25879463 -4.37 0.0047
d6 617364760 154399878 4.00 0.0071
d7 -1539663710 412457801 -3.73 0.0097
d8 0 . . .
d9 5151630410 1517635765 3.39 0.0146
d10 0 . . .
d11 0 . . .
d12 0 . . .
d13 0 . . .
d14 -4.28733E11 1.406149E11 -3.05 0.0225
d15 0 . . .
Table 2: Parameter Estimate for Model 2.
Table 2 shows that the model of resistance being a polynomial of degree 15 in diameter might be over fitted.
Its R-square value of 0.9944 is the only high because every time you add a predictor to a model, the R-
squared increases, supporting the argument that this model is not a good fit.
Table 3: Parameter Estimate for Model 3.
Model 3 is a rather good fit as it has R-square of 1. However, the p-value of 0.3798 for the parameter 1/d
suggests that the parameter is not significant in the model. Based on this, model 4 is fitted with only 1/d2
, as
a simpler model is usually preferred in general.
Parameter Estimates R-Square
Variable Parameter
Estimate
Standard
Error
t Value Pr > |t| 1.0000
Intercept -9.68163 0.00659 -1469.6 <.0001
Log(d) -1.99987 0.00208 -963.01 <.0001
Table 1: Parameter Estimate for Model 1.
Parameter Estimates R-Square
Variable Parameter Estimate Standard Error t Value Pr > |t| 1.0000
Intercept 0.00024745 0.00061146 0.40 0.6923
1/d2 0.00006279 2.538051E-7 247.39 <.0001
1/d -0.00002805 0.00003085 -0.91 0.3798
9. Task 2: The Resistance of Constantin
Table 4: Parameter Estimate for Model 4.
In model 4, the transformed variable 1/d2
is a significant parameter, with p-value of less than 0.0001. The fit
plot is positively linear with an R-square of value 1.
Based on the analysis so far, models 1 and 4 give the best fit compared to the other 2 models. These results
agree with the suggested potential models derived from the exploratory findings. Further analysis has to be
carried out to establish a conclusion.
Cross Validation
Leave-one-out cross validation was carried out to test a goodness of fit of the models. The models are
compared below.
Figure 3: Residual plots for the 4 models fitted.
First, we consider the residual plots of all 4 models. As shown, model 1, 3 and 4 are relatively better fitted
than model 2. Model 2 has one extreme residual value of over 20000, which causes the large scale of the
residual plot. Moreover, all its residual values are relatively larger compared to other 3 models. Meanwhile,
the residuals in model 1 are randomly scattered, but have values that are larger than that of model 3 and 4.
Parameter Estimates R-Square
Variable Parameter
Estimate
Standard
Error
t Value Pr > |t| 1.0000
Intercept -0.00020809 0.00034825 -0.60 0.5597
1/d2 0.00006257 7.917185E-8 790.31 <.0001
10. Task 2: The Resistance of Constantin
The residual plots for Model 3 and 4 have the smallest scales among all plots. This suggests that model 3 and
4 have the smallest residuals, and thus could be better as regression models. Then, we look at the by
estimates for out-of-sample RMSE and MAE obtained.
Table 5: Table of error measures and their confidence intervals for different regression models.
As expressed in the table above, the second suggested method (R=1/d2
), which transforms diameter to the
power of -2, shows the lowest value for both measures of prediction goodness. However, there is in overlap
in the 95% confidence intervals of the error measures for R=1/d2
and R=1/d2
+1/d. We cannot conclude that
model 4 is the best model yet. Hence, paired Wilcoxon signed-rank tests have been done between all 4
different models to confirm the results above.
Models Log(R)=log(d) R=d+d2
+d3
+d4
+d5
+d6
+d7
+d8
+d9+d10
+d11
+d
12
+d13
+d14
+d15
R=1/d2
+1/d R=1/d2
Log(R)=log(d) S=65
p-value=0.0002
S=64
p-value=0.0002
S=68
p-value=<.0001
R=d+d2
+d3
+d4
+d5
+d6
+
d7
+d8
+d9+d10
+d11
+d12
+
d13
+d14
+d15
S=67
p-value=<.0001
S=67
p-value=<.0001
R=1/d2
+1/d S=5
p-value=0.8209
Table 6: Results of paired Wilcoxon signed-rank tests on absolute residuals.
The tests support the deduction that the performances of the models are significantly different from one
another, as most p-values are less than 0.05, indicating significant difference in the absolute residuals from
each regression model. The exception to this is the 4th
(R=1/d2
) and 3rd
models (R=1/d2
+1/d), with p-
value=0.8209 suggesting no significant difference in model performances. Generally, models with less
covariates is better when 2 models are similar. Thus, in this case, the 4th
model is the best model in
explaining resistance in terms of diameter.
Conclusion
Based on the investigation, Model 2 (R=d+d2
+d3
+d4
+d5
+d6
+d7
+d8
+d9
+d10
+d11
+d12
+d13
+d14
+d15
) is an
extreme example of fitting an overly complicated model to get a good fit. The model is too complex for the
data even though it appears to explain a lot of variation in the response variable. Model 1 (Log(R)=log(d)) is
relatively good but it does not have the lowest RMSE and MAE, suggesting that the residual it produces is
relatively large. Meanwhile, Model 3 (R=1/d2
+1/d) has one insignificant covariate and leads to the second
suggested model.
In conclusion, Model 4 (R=1/d2
) is the best model in explaining the resistance of Constantin wire in terms of
varying diameter, producing the simplest model with high goodness of fit and the smallest residuals, as
evidenced by the high R-squared value and low RMSE and MAE error measures. The model can be
interpreted as when diameter of Constantin wire decreases, the square of diameter decreases, the reciprocal
of the square of diameter increase, the resistance increases.
Model RMSE MAE
Suggested (i) Log(R)=log(d) 0.0093
(0.0092, 0.0094)
0.0065
(0.0049, 0.0081)
Part (b) (i) R=d+d2
+d3
+d4
+d5
+d6
+d7
+d8
+d9+d10
+d11
+d12
+d13
+d14
+d15
5358.40
(2764.46, 7952.34)
1340.41
(0.91, 2679.91)
Part (b) (ii) R=1/d2
+1/d 0.0019
(0.0014, 0.0024)
0.0009
(0.0005, 0.0013)
Suggested (ii) R=1/d2
0.0013
(0.0009, 0.0017)
0.0006
(0.0003, 0.0009)
11. Task 1 Code
Appendix A: Task 1 R Code
# Part (a). Data load and conversation of day column.
# Load data from CSV file.
ISE_data=read.csv(file="C:/Documents/STAT7001/Istanbul.csv",
header=TRUE, sep=",")
# Convert the date column into a recognisable date format in R.
ISE_data$date=as.POSIXct(ISE_data$date, format="%d-%b-%Y")
# Find the difference in numbers of days, and round off any decimals.
ISE_data$date<-difftime(ISE_data$date,ISE_data$date[1], units="days")
ISE_data$date<-round(ISE_data$date,digits=0)
ISE_data$date=as.numeric(as.character(ISE_data$date))
12. Task 1 Code
# Part (b). Exploratory data analysis.
# Association between index and time.
plot(ISE_data[,1], ISE_data[,2], xlab="Days", ylab="ISE",
abline(lm(ISE~date, ISE_data)))
plot(ISE_data[,1], ISE_data[,3], xlab="Days", ylab="S&P 500",
abline(lm(S.P.500~date, ISE_data)))
plot(ISE_data[,1], ISE_data[,4], xlab="Days", ylab="DAX",
abline(lm(DAX~date, ISE_data)))
plot(ISE_data[,1], ISE_data[,5], xlab="Days", ylab="FTSE 100",
abline(lm(FTSE~date, ISE_data)))
plot(ISE_data[,1], ISE_data[,6], xlab="Days", ylab="Nikkei 225",
abline(lm(NIKKEI~date, ISE_data)))
plot(ISE_data[,1], ISE_data[,7], xlab="Days", ylab="Ibovespa",
abline(lm(BOVESPA~date, ISE_data)))
plot(ISE_data[,1], ISE_data[,8], xlab="Days", ylab="MSCI EU Index",
abline(lm(MSCI.EU~date, ISE_data)))
plot(ISE_data[,1], ISE_data[,9], xlab="Days", ylab="MSCI EM Index",
abline(lm(MSCI.EM~date, ISE_data)))
cor.test(ISE_data[,1], ISE_data[,2])
cor.test(ISE_data[,1], ISE_data[,3])
cor.test(ISE_data[,1], ISE_data[,4])
cor.test(ISE_data[,1], ISE_data[,5])
cor.test(ISE_data[,1], ISE_data[,6])
cor.test(ISE_data[,1], ISE_data[,7])
cor.test(ISE_data[,1], ISE_data[,8])
cor.test(ISE_data[,1], ISE_data[,9])
# Association between ISE index and index the days before.
plot(ISE_data[c(2:536),2], ISE_data[c(1:535),2], xlab="ISE, Day N", ylab="ISE, Day N-1")
plot(ISE_data[c(3:536),2], ISE_data[c(1:534),2], xlab="ISE, Day N", ylab="ISE, Day N-2")
plot(ISE_data[c(4:536),2], ISE_data[c(1:533),2], xlab="ISE, Day N", ylab="ISE, Day N-3")
cor.test(ISE_data[c(2:536),2], ISE_data[c(1:535),2])
cor.test(ISE_data[c(3:536),2], ISE_data[c(1:534),2])
cor.test(ISE_data[c(4:536),2], ISE_data[c(1:533),2])
# Association between S&P 500 index and index the days before.
plot(ISE_data[c(2:536),3], ISE_data[c(1:535),3],
xlab="S&P 500, Day N", ylab="S&P 500, Day N-1")
plot(ISE_data[c(3:536),3], ISE_data[c(1:534),3],
xlab="S&P 500, Day N", ylab="S&P 500, Day N-2")
plot(ISE_data[c(4:536),3], ISE_data[c(1:533),3],
xlab="S&P 500, Day N", ylab="S&P 500, Day N-3")
cor.test(ISE_data[c(2:536),3], ISE_data[c(1:535),3])
cor.test(ISE_data[c(3:536),3], ISE_data[c(1:534),3])
cor.test(ISE_data[c(4:536),3], ISE_data[c(1:533),3])
# Association between DAX index and index the days before.
plot(ISE_data[c(2:536),4], ISE_data[c(1:535),4], xlab="DAX, Day N", ylab="DAX, Day N-1")
plot(ISE_data[c(3:536),4], ISE_data[c(1:534),4], xlab="DAX, Day N", ylab="DAX, Day N-2")
plot(ISE_data[c(4:536),4], ISE_data[c(1:533),4], xlab="DAX, Day N", ylab="DAX, Day N-3")
cor.test(ISE_data[c(2:536),4], ISE_data[c(1:535),4])
cor.test(ISE_data[c(3:536),4], ISE_data[c(1:534),4])
cor.test(ISE_data[c(4:536),4], ISE_data[c(1:533),4])
# Association between FTSE 100 index and index the days before.
plot(ISE_data[c(2:536),5], ISE_data[c(1:535),5],
xlab="FTSE 100, Day N", ylab="FTSE 100, Day N-1")
plot(ISE_data[c(3:536),5], ISE_data[c(1:534),5],
xlab="FTSE 100, Day N", ylab="FTSE 100, Day N-2")
plot(ISE_data[c(4:536),5], ISE_data[c(1:533),5],
xlab="FTSE 100, Day N", ylab="FTSE 100, Day N-3")
cor.test(ISE_data[c(2:536),5], ISE_data[c(1:535),5])
cor.test(ISE_data[c(3:536),5], ISE_data[c(1:534),5])
cor.test(ISE_data[c(4:536),5], ISE_data[c(1:533),5])
13. Task 1 Code
# Association between Nikkei 225 index and index the days before.
plot(ISE_data[c(2:536),6], ISE_data[c(1:535),6],
xlab="Nikkei 225, Day N", ylab="Nikkei 225, Day N-1")
plot(ISE_data[c(3:536),6], ISE_data[c(1:534),6],
xlab="Nikkei 225, Day N", ylab="Nikkei 225, Day N-2")
plot(ISE_data[c(4:536),6], ISE_data[c(1:533),6],
xlab="Nikkei 225, Day N", ylab="Nikkei 225, Day N-3")
cor.test(ISE_data[c(2:536),6], ISE_data[c(1:535),6])
cor.test(ISE_data[c(3:536),6], ISE_data[c(1:534),6])
cor.test(ISE_data[c(4:536),6], ISE_data[c(1:533),6])
# Association between Ibovespa index and index the days before.
plot(ISE_data[c(2:536),7], ISE_data[c(1:535),7],
xlab="Ibovespa, Day N", ylab="Ibovespa, Day N-1")
plot(ISE_data[c(3:536),7], ISE_data[c(1:534),7],
xlab="Ibovespa, Day N", ylab="Ibovespa, Day N-2")
plot(ISE_data[c(4:536),7], ISE_data[c(1:533),7],
xlab="Ibovespa, Day N", ylab="Ibovespa, Day N-3")
cor.test(ISE_data[c(2:536),7], ISE_data[c(1:535),7])
cor.test(ISE_data[c(3:536),7], ISE_data[c(1:534),7])
cor.test(ISE_data[c(4:536),7], ISE_data[c(1:533),7])
# Association between MSCI EU index and index the days before.
plot(ISE_data[c(2:536),8], ISE_data[c(1:535),8],
xlab="MSCI EU, Day N", ylab="MSCI EU, Day N-1")
plot(ISE_data[c(3:536),8], ISE_data[c(1:534),8],
xlab="MSCI EU, Day N", ylab="MSCI EU, Day N-2")
plot(ISE_data[c(4:536),8], ISE_data[c(1:533),8],
xlab="MSCI EU, Day N", ylab="MSCI EU, Day N-3")
cor.test(ISE_data[c(2:536),8], ISE_data[c(1:535),8])
cor.test(ISE_data[c(3:536),8], ISE_data[c(1:534),8])
cor.test(ISE_data[c(4:536),8], ISE_data[c(1:533),8])
# Association between MSCI EM index and index the days before.
plot(ISE_data[c(2:536),9], ISE_data[c(1:535),9],
xlab="MSCI EM, Day N", ylab="MSCI EM, Day N-1")
plot(ISE_data[c(3:536),9], ISE_data[c(1:534),9],
xlab="MSCI EM, Day N", ylab="MSCI EM, Day N-2")
plot(ISE_data[c(4:536),9], ISE_data[c(1:533),9],
xlab="MSCI EM, Day N", ylab="MSCI EM, Day N-3")
cor.test(ISE_data[c(2:536),9], ISE_data[c(1:535),9])
cor.test(ISE_data[c(3:536),9], ISE_data[c(1:534),9])
cor.test(ISE_data[c(4:536),9], ISE_data[c(1:533),9])
14. Task 1 Code
# Part (c). Benchmarking with all data.
# ----------------------------------------------------------------------------
# Creating functions for measures of prediction goodness and their std errors.
# ----------------------------------------------------------------------------
# (i) Root mean squared error (RMSE)
rmse=function(observed, fitted){
sqrt(mean((observed-fitted)^2))
}
rmseSE=function(observed, fitted){
sd((observed-fitted)^2)/sqrt(length(observed))/(2*sqrt(mean((observed-fitted)^2)))
}
# (ii) Mean absolute error (MAE)
mae=function(observed, fitted){
mean(abs(observed-fitted))
}
maeSE=function(observed, fitted){
sd(abs(observed-fitted))/sqrt(length(observed))
}
# (iii) Relative RMSE
RELrmse=function(observed, fitted){
sqrt(mean(((observed-fitted)/observed)^2))
}
RELrmseSE=function(observed, fitted){
sd(((observed-fitted)/observed)^2)/sqrt(length(observed))/
(2*sqrt(mean(((observed-fitted)/observed)^2)))
}
# (iv) Relative MAE
RELmae=function(observed, fitted){
mean(abs((observed-fitted)/observed))
}
RELmaeSE=function(observed, fitted){
sd(abs((observed-fitted)/observed))/sqrt(length(observed))
}
# ---------------------------------------------------------------------------------
# Comparison of prediction methods, using validation set-up (i).
# i.e. Chronologically first 80% of data (428.8 or 429 entries) as training sample;
# remaining data as test sample.
# ---------------------------------------------------------------------------------
# Prediction method (i): Mean
# -- Predictor
Chr.ISEmean=mean(ISE_data$ISE[c(1:429)])
# -- Predicted values
Chr.ISEmean
# -- Error measures
Chr.mean.rmse = rmse(ISE_data$ISE[c(430:536)], Chr.ISEmean)
Chr.mean.rmseSE = rmseSE(ISE_data$ISE[c(430:536)], Chr.ISEmean)
Chr.mean.rmse-1.96*Chr.mean.rmseSE; Chr.mean.rmse+1.96*Chr.mean.rmseSE
Chr.mean.mae = mae(ISE_data$ISE[c(430:536)], Chr.ISEmean)
Chr.mean.maeSE = maeSE(ISE_data$ISE[c(430:536)], Chr.ISEmean)
Chr.mean.mae-1.96*Chr.mean.maeSE; Chr.mean.mae+1.96*Chr.mean.maeSE
Chr.mean.RELrmse = RELrmse(ISE_data$ISE[c(430:536)], Chr.ISEmean)
Chr.mean.RELrmseSE = RELrmseSE(ISE_data$ISE[c(430:536)], Chr.ISEmean)
Chr.mean.RELrmse-1.96*Chr.mean.RELrmseSE; Chr.mean.RELrmse+1.96*Chr.mean.RELrmseSE
Chr.mean.RELmae = RELmae(ISE_data$ISE[c(430:536)], Chr.ISEmean)
Chr.mean.RELmaeSE = RELmaeSE(ISE_data$ISE[c(430:536)], Chr.ISEmean)
Chr.mean.RELmae-1.96*Chr.mean.RELmaeSE; Chr.mean.RELmae+1.96*Chr.mean.RELmaeSE
16. Task 1 Code
# ---------------------------------------------------------------
# Comparison of prediction methods, using validation set-up (ii).
# i.e. Five-fold cross-validation with uniformly randomly sampled folds.
# ---------------------------------------------------------------
# Five-fold cross-validation data setup.
# Create random permutation of values.
set.seed(555)
randperm=sample(nrow(ISE_data))
# Create lists with test folds and their respective training folds.
trainfolds=list()
testfolds=list()
for(i in 1:5){
lower=floor((i-1)*nrow(ISE_data)/5)+1
upper=floor(i*nrow(ISE_data)/5)
testfolds[[i]]=randperm[lower:upper]
trainfolds[[i]]=setdiff(1:nrow(ISE_data),testfolds[[i]])
testfolds[[i]]=ISE_data[testfolds[[i]],]
trainfolds[[i]]=ISE_data[trainfolds[[i]],]
}
# ---------------------------------------------------------------
# Prediction method (i): Mean
# -- Predictor
Fol.ISEmean=list()
for(i in 1:5){
Fol.ISEmean[[i]]=mean(trainfolds[[i]][[2]])
}
# -- Predicted values
Fol.ISEmean
# -- Error measures
# *** RMSE ***
Fol.mean.rmse=list()
for(i in 1:5){
Fol.mean.rmse[[i]]=rmse(testfolds[[i]]$ISE, Fol.ISEmean[[i]])
}
Fol.mean.rmse=mean(as.numeric(Fol.mean.rmse))
# Standard Error
Fol.mean.rmseSE=list()
for(i in 1:5){
Fol.mean.rmseSE[[i]]=rmseSE(testfolds[[i]]$ISE, Fol.ISEmean[[i]])
}
Fol.mean.rmseSE=mean(as.numeric(Fol.mean.rmseSE))
# Confidence Interval
Fol.mean.rmse-1.96*Fol.mean.rmseSE; Fol.mean.rmse+1.96*Fol.mean.rmseSE
# *** MAE ***
Fol.mean.mae=list()
for(i in 1:5){
Fol.mean.mae[[i]]=mae(testfolds[[i]]$ISE, Fol.ISEmean[[i]])
}
Fol.mean.mae=mean(as.numeric(Fol.mean.mae))
# Standard Error
Fol.mean.maeSE=list()
for(i in 1:5){
Fol.mean.maeSE[[i]]=maeSE(testfolds[[i]]$ISE, Fol.ISEmean[[i]])
}
Fol.mean.maeSE=mean(as.numeric(Fol.mean.maeSE))
# Confidence Interval
Fol.mean.mae-1.96*Fol.mean.maeSE; Fol.mean.mae+1.96*Fol.mean.maeSE
# *** Relative RMSE ***
Fol.mean.RELrmse=list()
for(i in 1:5){
Fol.mean.RELrmse[[i]]=RELrmse(testfolds[[i]]$ISE, Fol.ISEmean[[i]])
}
17. Task 1 Code
Fol.mean.RELrmse=mean(as.numeric(Fol.mean.RELrmse))
# Standard Error
Fol.mean.RELrmseSE=list()
for(i in 1:5){
Fol.mean.RELrmseSE[[i]]=RELrmseSE(testfolds[[i]]$ISE, Fol.ISEmean[[i]])
}
Fol.mean.RELrmseSE=mean(as.numeric(Fol.mean.RELrmseSE))
# Confidence Interval
Fol.mean.RELrmse-1.96*Fol.mean.RELrmseSE; Fol.mean.RELrmse+1.96*Fol.mean.RELrmseSE
# *** Relative MAE ***
Fol.mean.RELmae=list()
for(i in 1:5){
Fol.mean.RELmae[[i]]=RELmae(testfolds[[i]]$ISE, Fol.ISEmean[[i]])
}
Fol.mean.RELmae=mean(as.numeric(Fol.mean.RELmae))
# Standard Error
Fol.mean.RELmaeSE=list()
for(i in 1:5){
Fol.mean.RELmaeSE[[i]]=RELmaeSE(testfolds[[i]]$ISE, Fol.ISEmean[[i]])
}
Fol.mean.RELmaeSE=mean(as.numeric(Fol.mean.RELmaeSE))
# Confidence Interval
Fol.mean.RELmae-1.96*Fol.mean.RELmaeSE; Fol.mean.RELmae+1.96*Fol.mean.RELmaeSE
# ---------------------------------------------------------------
# Prediction method (ii): Linear model excluding time.
# -- Models
Fol.LMnoTime=list()
for(i in 1:5){
Fol.LMnoTime[[i]]=lm(ISE~S.P.500 + DAX + FTSE + NIKKEI + BOVESPA + MSCI.EU + MSCI.EM,
data=trainfolds[[i]])
}
# -- Predicted values
Fol.LMnoTime.Pred=list()
for(i in 1:5){
Fol.LMnoTime.Pred[[i]]=predict(Fol.LMnoTime[[i]], testfolds[[i]])
}
# -- Error measures
# *** RMSE ***
Fol.LMnoTime.rmse=list()
for(i in 1:5){
Fol.LMnoTime.rmse[[i]]=rmse(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]])
}
Fol.LMnoTime.rmse=mean(as.numeric(Fol.LMnoTime.rmse))
# Standard Error
Fol.LMnoTime.rmseSE=list()
for(i in 1:5){
Fol.LMnoTime.rmseSE[[i]]=rmseSE(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]])
}
Fol.LMnoTime.rmseSE=mean(as.numeric(Fol.LMnoTime.rmseSE))
# Confidence Interval
Fol.LMnoTime.rmse-1.96*Fol.LMnoTime.rmseSE; Fol.LMnoTime.rmse+1.96*Fol.LMnoTime.rmseSE
# *** MAE ***
Fol.LMnoTime.mae=list()
for(i in 1:5){
Fol.LMnoTime.mae[[i]]=mae(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]])
}
Fol.LMnoTime.mae=mean(as.numeric(Fol.LMnoTime.mae))
# Standard Error
Fol.LMnoTime.maeSE=list()
for(i in 1:5){
Fol.LMnoTime.maeSE[[i]]=maeSE(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]])
}
Fol.LMnoTime.maeSE=mean(as.numeric(Fol.LMnoTime.maeSE))
# Confidence Interval
18. Task 1 Code
Fol.LMnoTime.mae-1.96*Fol.LMnoTime.maeSE; Fol.LMnoTime.mae+1.96*Fol.LMnoTime.maeSE
# *** Relative RMSE ***
Fol.LMnoTime.RELrmse=list()
for(i in 1:5){
Fol.LMnoTime.RELrmse[[i]]=RELrmse(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]])
}
Fol.LMnoTime.RELrmse=mean(as.numeric(Fol.LMnoTime.RELrmse))
# Standard Error
Fol.LMnoTime.RELrmseSE=list()
for(i in 1:5){
Fol.LMnoTime.RELrmseSE[[i]]=RELrmseSE(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]])
}
Fol.LMnoTime.RELrmseSE=mean(as.numeric(Fol.LMnoTime.RELrmseSE))
# Confidence Interval
Fol.LMnoTime.RELrmse-1.96*Fol.LMnoTime.RELrmseSE;
Fol.LMnoTime.RELrmse+1.96*Fol.LMnoTime.RELrmseSE
# *** Relative MAE ***
Fol.LMnoTime.RELmae=list()
for(i in 1:5){
Fol.LMnoTime.RELmae[[i]]=RELmae(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]])
}
Fol.LMnoTime.RELmae=mean(as.numeric(Fol.LMnoTime.RELmae))
# Standard Error
Fol.LMnoTime.RELmaeSE=list()
for(i in 1:5){
Fol.LMnoTime.RELmaeSE[[i]]=RELmaeSE(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]])
}
Fol.LMnoTime.RELmaeSE=mean(as.numeric(Fol.LMnoTime.RELmaeSE))
# Confidence Interval
Fol.LMnoTime.RELmae-1.96*Fol.LMnoTime.RELmaeSE;
Fol.LMnoTime.RELmae+1.96*Fol.LMnoTime.RELmaeSE
# ---------------------------------------------------------------
# Prediction method (iii): Linear model including time.
# -- Models
Fol.LMwithTime=list()
for(i in 1:5){
Fol.LMwithTime[[i]]=lm(ISE~date+S.P.500+DAX+FTSE+NIKKEI+BOVESPA+MSCI.EU+MSCI.EM,
data=trainfolds[[i]])
}
# -- Predicted values
Fol.LMwithTime.Pred=list()
for(i in 1:5){
Fol.LMwithTime.Pred[[i]]=predict(Fol.LMwithTime[[i]], testfolds[[i]])
}
# -- Error measures
# *** RMSE ***
Fol.LMwithTime.rmse=list()
for(i in 1:5){
Fol.LMwithTime.rmse[[i]]=rmse(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]])
}
Fol.LMwithTime.rmse=mean(as.numeric(Fol.LMwithTime.rmse))
# Standard Error
Fol.LMwithTime.rmseSE=list()
for(i in 1:5){
Fol.LMwithTime.rmseSE[[i]]=rmseSE(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]])
}
Fol.LMwithTime.rmseSE=mean(as.numeric(Fol.LMwithTime.rmseSE))
# Confidence Interval
Fol.LMwithTime.rmse-1.96*Fol.LMwithTime.rmseSE;
Fol.LMwithTime.rmse+1.96*Fol.LMwithTime.rmseSE
# *** MAE ***
Fol.LMwithTime.mae=list()
for(i in 1:5){
19. Task 1 Code
Fol.LMwithTime.mae[[i]]=mae(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]])
}
Fol.LMwithTime.mae=mean(as.numeric(Fol.LMwithTime.mae))
# Standard Error
Fol.LMwithTime.maeSE=list()
for(i in 1:5){
Fol.LMwithTime.maeSE[[i]]=maeSE(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]])
}
Fol.LMwithTime.maeSE=mean(as.numeric(Fol.LMwithTime.maeSE))
# Confidence Interval
Fol.LMwithTime.mae-1.96*Fol.LMwithTime.maeSE;
Fol.LMwithTime.mae+1.96*Fol.LMwithTime.maeSE
# *** Relative RMSE ***
Fol.LMwithTime.RELrmse=list()
for(i in 1:5){
Fol.LMwithTime.RELrmse[[i]]=RELrmse(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]])
}
Fol.LMwithTime.RELrmse=mean(as.numeric(Fol.LMwithTime.RELrmse))
# Standard Error
Fol.LMwithTime.RELrmseSE=list()
for(i in 1:5){
Fol.LMwithTime.RELrmseSE[[i]]=RELrmseSE(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]])
}
Fol.LMwithTime.RELrmseSE=mean(as.numeric(Fol.LMwithTime.RELrmseSE))
# Confidence Interval
Fol.LMwithTime.RELrmse-1.96*Fol.LMwithTime.RELrmseSE;
Fol.LMwithTime.RELrmse+1.96*Fol.LMwithTime.RELrmseSE
# *** Relative MAE ***
Fol.LMwithTime.RELmae=list()
for(i in 1:5){
Fol.LMwithTime.RELmae[[i]]=RELmae(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]])
}
Fol.LMwithTime.RELmae=mean(as.numeric(Fol.LMwithTime.RELmae))
# Standard Error
Fol.LMwithTime.RELmaeSE=list()
for(i in 1:5){
Fol.LMwithTime.RELmaeSE[[i]]=RELmaeSE(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]])
}
Fol.LMwithTime.RELmaeSE=mean(as.numeric(Fol.LMwithTime.RELmaeSE))
# Confidence Interval
Fol.LMwithTime.RELmae-1.96*Fol.LMwithTime.RELmaeSE;
Fol.LMwithTime.RELmae+1.96*Fol.LMwithTime.RELmaeSE
# ---------------------------------------------------------------
# Comparison of prediction methods.
# Vector of residuals for prediction method (i).
Fol.ISEmean.resid=list()
for(i in 1:5){
Fol.ISEmean.resid[[i]]=testfolds[[i]]$ISE-Fol.ISEmean[[i]]
}
Fol.ISEmean.resid=unlist(Fol.ISEmean.resid)
# Vector of residuals for prediction method (ii).
Fol.LMnoTime.resid=list()
for(i in 1:5){
Fol.LMnoTime.resid[[i]]=testfolds[[i]]$ISE-Fol.LMnoTime.Pred[[i]]
}
Fol.LMnoTime.resid=unlist(Fol.LMnoTime.resid)
# Vector of residuals for prediction method (iii).
Fol.LMwithTime.resid=list()
for(i in 1:5){
Fol.LMwithTime.resid[[i]]=testfolds[[i]]$ISE-Fol.LMwithTime.Pred[[i]]
}
Fol.LMwithTime.resid=unlist(Fol.LMwithTime.resid)
# Test for comparison of prediction methods.
wilcox.test(abs(Fol.ISEmean.resid), abs(Fol.LMnoTime.resid), paired=TRUE)
wilcox.test(abs(Fol.ISEmean.resid), abs(Fol.LMwithTime.resid), paired=TRUE)
wilcox.test(abs(Fol.LMnoTime.resid), abs(Fol.LMwithTime.resid), paired=TRUE)
20. Task 1 Code
# Part (d). Benchmarking with previous data.
#create a vector of errors for RMSE and MAE in (i)
ISE.error1=vector(mode="numeric", length=526)
result.index=0
for(n in 11:536){
result.index=result.index+1
error1=ISE_data[n,2]-ISE_data[n-1,2]
ISE.error1[result.index]=error1
}
#calculate RMSE for (i)
(RMSE1=sqrt(mean(ISE.error1^2)))
#calculate MAE for (i)
(MAE1=mean(abs(ISE.error1)))
#calculate standard error of RMSE for (i)
(SE.RMSE1=(sd(ISE.error1^2)/sqrt(526))/(2*sqrt(mean(ISE.error1^2))))
#calculate standard error of MAE for (i)
(SE.MAE1=sd(abs(ISE.error1))/sqrt(526))
#95% confidence interval for RMSE
RMSE1-1.96*SE.RMSE1; RMSE1+1.96*SE.RMSE1
#95% confidence interval for MAE
MAE1-1.96*SE.MAE1; MAE1+1.96*SE.MAE1
#create a vector of errors for relative RMSE and relative MAE in (i)
ISE.rerror1=ISE.error1/ISE_data[c(11:536),2]
(rRMSE1=sqrt(mean(ISE.rerror1^2)))
#calculate relatove MAE for (i)
(rMAE1=mean(abs(ISE.rerror1)))
#calculate standard error of relative RMSE for (i)
(SE.rRMSE1=(sd(ISE.rerror1^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror1^2))))
#calculate standard error of relative MAE for (i)
(SE.rMAE1=sd(abs(ISE.rerror1))/sqrt(526))
#95% confidence interval for relaitive RMSE
rRMSE1-1.96*SE.rRMSE1; rRMSE1+1.96*SE.rRMSE1
#95% confidence interval for relative MAE
rMAE1-1.96*SE.rMAE1; rMAE1+1.96*SE.rMAE1
############
#create a vector of errors for RMSE and MAE in (ii)
ISE.error2=vector(mode="numeric", length=526)
result.index=0
for(n in 11:536){
result.index=result.index+1
error2=ISE_data[n,2]-mean(ISE_data[c((n-5):(n-1)),2])
ISE.error2[result.index]=error2
}
#calculate RMSE for (ii)
(RMSE2=sqrt(mean(ISE.error2^2)))
#calculate MAE for (ii)
(MAE2=mean(abs(ISE.error2)))
#calculate standard error of RMSE for (ii)
(SE.RMSE2=(sd(ISE.error2^2)/sqrt(526))/(2*sqrt(mean(ISE.error2^2))))
#calculate standard error of MAE for (ii)
(SE.MAE2=sd(abs(ISE.error2))/sqrt(526))
#95% confidence interval for RMSE
RMSE2-1.96*SE.RMSE2; RMSE2+1.96*SE.RMSE2
#95% confidence interval for MAE
MAE2-1.96*SE.MAE2; MAE2+1.96*SE.MAE2
#create a vector of errors for relative RMSE and relative MAE in (ii)
ISE.rerror2=ISE.error2/ISE_data[c(11:536),2]
21. Task 1 Code
#calculate relative RMSE for (ii)
(rRMSE2=sqrt(mean(ISE.rerror2^2)))
#calculate relative MAE for (ii)
(rMAE2=mean(abs(ISE.rerror2)))
#calculate standard error of relative RMSE for (ii)
(SE.rRMSE2=(sd(ISE.rerror2^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror2^2))))
#calculate standard error of relative MAE for (ii)
(SE.rMAE2=sd(abs(ISE.rerror2))/sqrt(526))
#95% confidence interval for relative RMSE
rRMSE2-1.96*SE.rRMSE2; rRMSE2+1.96*SE.rRMSE2
#95% confidence interval for relative MAE
rMAE2-1.96*SE.rMAE2; rMAE2+1.96*SE.rMAE2
#################################################################################
#create a vector of errors for RMSE and MAE in (iii)
ISE_data.iii=ISE_data[-536,]
ISE_data.iii$ISE.predicted=ISE_data$ISE[2:536]
ISE.error3=vector(mode="numeric", length=526)
result.index=0
for(n in 10:535){
result.index=result.index+1
lmmodel3=lm(ISE.predicted~ISE+S.P.500+DAX+FTSE+NIKKEI+BOVESPA+MSCI.EU+MSCI.EM,
data=ISE_data.iii[(n-9):(n-1),])
error3=ISE_data.iii[n,10]-predict(lmmodel3, ISE_data.iii[n,])
ISE.error3[result.index]=error3
}
#calculate RMSE for (iii)
(RMSE3=sqrt(mean(ISE.error3^2)))
#calculate MAE for (iii)
(MAE3=mean(abs(ISE.error3)))
#calculate standard error of RMSE for (iii)
(SE.RMSE3=(sd(ISE.error3^2)/sqrt(526))/(2*sqrt(mean(ISE.error3^2))))
#calculate standard error of MAE for (iii)
(SE.MAE3=sd(abs(ISE.error3))/sqrt(526))
#95% confidence interval for RMSE
RMSE3-1.96*SE.RMSE3; RMSE3+1.96*SE.RMSE3
#95% confidence interval for MAE
MAE3-1.96*SE.MAE3; MAE3+1.96*SE.MAE3
#create a vector of errors for relative RMSE and relative MAE in (iii)
ISE.rerror3=ISE.error3/ISE_data.iii[c(10:535),10]
#calculate relative RMSE for (iii)
(rRMSE3=sqrt(mean(ISE.rerror3^2)))
#calculate relative MAE for (iii)
(rMAE3=mean(abs(ISE.rerror3)))
#calculate standard error of relative RMSE for (iii)
(SE.rRMSE3=(sd(ISE.rerror3^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror3^2))))
#calculate standard error of relative MAE for (iii)
(SE.rMAE3=sd(abs(ISE.rerror3))/sqrt(526))
#95% confidence interval for relative RMSE
rRMSE3-1.96*SE.rRMSE3; rRMSE3+1.96*SE.rRMSE3
#95% confidence interval for relative MAE
rMAE3-1.96*SE.rMAE3; rMAE3+1.96*SE.rMAE3
22. Task 1 Code
########################################################################################
#create a vector of errors for RMSE and MAE in (iv)
ISE_data.iv=ISE_data[-c(535,536),]
ISE_data.extracted=ISE_data[-c(1,536),-1]
ISE_data.iv=cbind(ISE_data.iv,ISE_data.extracted)
ISE_data.iv$ISE.predicted=ISE_data[-c(1,2),2]
names(ISE_data.iv)=c("date","ISE2","S.P.5002","DAX2","FTSE2",
"NIKKEI2","BOVESPA2","MSCI.EU2","MSCI.EM2",
"ISE1","S.P.5001","DAX1","FTSE1",
"NIKKEI1","BOVESPA1","MSCI.EU1","MSCI.EM1","ISE.predicted")
ISE.error4=vector(mode="numeric", length=526)
result.index=0
for(n in 9:534){
result.index=result.index+1
lmmodel4=lm(ISE.predicted~ISE2+S.P.5002+DAX2+FTSE2+NIKKEI2+BOVESPA2+MSCI.EU2+MSCI.EM2
+ISE1+S.P.5001+DAX1+FTSE1+NIKKEI1+BOVESPA1+MSCI.EU1+MSCI.EM1,
data=ISE_data.iv[(n-8):(n-1),])
error4=ISE_data.iv[n,18]-predict(lmmodel4, ISE_data.iv[n,])
ISE.error4[result.index]=error4
}
#calculate RMSE for (iv)
(RMSE4=sqrt(mean(ISE.error4^2)))
#calculate MAE for (iv)
(MAE4=mean(abs(ISE.error4)))
#calculate standard error of RMSE for (iv)
(SE.RMSE4=(sd(ISE.error4^2)/sqrt(526))/(2*sqrt(mean(ISE.error4^2))))
#calculate standard error of MAE for (iv)
(SE.MAE4=sd(abs(ISE.error4))/sqrt(526))
#95% confidence interval for RMSE
RMSE4-1.96*SE.RMSE4; RMSE4+1.96*SE.RMSE4
#95% confidence interval for MAE
MAE4-1.96*SE.MAE4; MAE4+1.96*SE.MAE4
#########
#create a vector of errors for relative RMSE and relative MAE in (iv)
ISE.rerror4=ISE.error4/ISE_data.iv[c(9:534),18]
#calculate relative RMSE for (iv)
(rRMSE4=sqrt(mean(ISE.rerror4^2)))
#calculate relative MAE for (iv)
(rMAE4=mean(abs(ISE.rerror4)))
#calculate standard error of relative RMSE for (iv)
(SE.rRMSE4=(sd(ISE.rerror4^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror4^2))))
#calculate standard error of relative MAE for (iv)
(SE.rMAE4=sd(abs(ISE.rerror4))/sqrt(526))
#95% confidence interval for relative RMSE
rRMSE4-1.96*SE.rRMSE4; rRMSE4+1.96*SE.rRMSE4
#95% confidence interval for relative MAE
rMAE4-1.96*SE.rMAE4; rMAE4+1.96*SE.rMAE4
#######################################################################################
#wilcoxon tests to compare the 4 different methods
wilcox.test(abs(ISE.error1),abs(ISE.error2), paired=TRUE)
wilcox.test(abs(ISE.error1),abs(ISE.error3), paired=TRUE)
wilcox.test(abs(ISE.error1),abs(ISE.error4), paired=TRUE)
wilcox.test(abs(ISE.error2),abs(ISE.error3), paired=TRUE)
wilcox.test(abs(ISE.error2),abs(ISE.error4), paired=TRUE)
wilcox.test(abs(ISE.error3),abs(ISE.error4), paired=TRUE)
24. Task 1 Code
for(i in 1:5){
Fol.PartE.rmse[[i]]=rmse(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]])
}
Fol.PartE.rmse=mean(as.numeric(Fol.PartE.rmse))
# Standard Error
Fol.PartE.rmseSE=list()
for(i in 1:5){
Fol.PartE.rmseSE[[i]]=rmseSE(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]])
}
Fol.PartE.rmseSE=mean(as.numeric(Fol.PartE.rmseSE))
# Confidence Interval
Fol.PartE.rmse-1.96*Fol.PartE.rmseSE; Fol.PartE.rmse+1.96*Fol.PartE.rmseSE
# *** MAE ***
Fol.PartE.mae=list()
for(i in 1:5){
Fol.PartE.mae[[i]]=mae(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]])
}
Fol.PartE.mae=mean(as.numeric(Fol.PartE.mae))
# Standard Error
Fol.PartE.maeSE=list()
for(i in 1:5){
Fol.PartE.maeSE[[i]]=maeSE(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]])
}
Fol.PartE.maeSE=mean(as.numeric(Fol.PartE.maeSE))
# Confidence Interval
Fol.PartE.mae-1.96*Fol.PartE.maeSE; Fol.PartE.mae+1.96*Fol.PartE.maeSE
# *** Relative RMSE ***
Fol.PartE.RELrmse=list()
for(i in 1:5){
Fol.PartE.RELrmse[[i]]=RELrmse(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]])
}
Fol.PartE.RELrmse=mean(as.numeric(Fol.PartE.RELrmse))
# Standard Error
Fol.PartE.RELrmseSE=list()
for(i in 1:5){
Fol.PartE.RELrmseSE[[i]]=RELrmseSE(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]])
}
Fol.PartE.RELrmseSE=mean(as.numeric(Fol.PartE.RELrmseSE))
# Confidence Interval
Fol.PartE.RELrmse-1.96*Fol.PartE.RELrmseSE; Fol.PartE.RELrmse+1.96*Fol.PartE.RELrmseSE
# *** Relative MAE ***
Fol.PartE.RELmae=list()
for(i in 1:5){
Fol.PartE.RELmae[[i]]=RELmae(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]])
}
Fol.PartE.RELmae=mean(as.numeric(Fol.PartE.RELmae))
# Standard Error
Fol.PartE.RELmaeSE=list()
for(i in 1:5){
Fol.PartE.RELmaeSE[[i]]=RELmaeSE(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]])
}
Fol.PartE.RELmaeSE=mean(as.numeric(Fol.PartE.RELmaeSE))
# Confidence Interval
Fol.PartE.RELmae-1.96*Fol.PartE.RELmaeSE; Fol.PartE.RELmae+1.96*Fol.PartE.RELmaeSE
# Comparison of prediction methods.
# Vector of residuals for prediction method (iv).
Fol.PartE.resid=list()
for(i in 1:5){
Fol.PartE.resid[[i]]=testfolds[[i]]$ISE - Fol.PartE.Pred[[i]]
}
Fol.PartE.resid=unlist(Fol.PartE.resid)
# Test for comparison of prediction methods.
wilcox.test(abs(Fol.ISEmean.resid), abs(Fol.PartE.resid), paired=TRUE)
wilcox.test(abs(Fol.LMnoTime.resid), abs(Fol.PartE.resid), paired=TRUE)
wilcox.test(abs(Fol.LMwithTime.resid), abs(Fol.PartE.resid), paired=TRUE)
25. Task 1 Code
# Part (e)-(d). Robust linear regression with Part (d) validation setup.
#create a vector of errors for RMSE and MAE for 526 data splits
ISE.error5=vector(mode="numeric", length=526)
result.index=0
for(n in 10:535){
result.index=result.index+1
Sum.residuals=function(be,x,y){
res=be%*%t(x)
SAR=sum(abs(res-y))
return(SAR)
}
beta=nlm(Sum.residuals, p=c(10,10,10,10,10,10,10,10),
x=ISE_data.iii[(n-9):(n-1),-c(1,10)],
y=ISE_data.iii$ISE.predicted[(n-9):(n-1)], iterlim=300)$estimate
error5=ISE_data.iii$ISE.predicted[n]-beta%*%t(ISE_data.iii[n,2:9])
ISE.error5[result.index]=error5
}
#calculate RMSE
(RMSE5=sqrt(mean(ISE.error5^2)))
#calculate MAE
(MAE5=mean(abs(ISE.error5)))
#calculate standard error of RMSE
(SE.RMSE5=(sd(ISE.error5^2)/sqrt(526))/(2*sqrt(mean(ISE.error5^2))))
#calculate standard error of MAE
(SE.MAE5=sd(abs(ISE.error5))/sqrt(526))
#95% confidence interval for RMSE
RMSE5-1.96*SE.RMSE5; RMSE5+1.96*SE.RMSE5
#95% confidence interval for MAE
MAE5-1.96*SE.MAE5; MAE5+1.96*SE.MAE5
#create a vector of errors for relative RMSE and relative MAE in (i)
ISE.rerror5=ISE.error5/ISE_data[c(11:536),2]
#calculate relative RMSE
(rRMSE5=sqrt(mean(ISE.rerror5^2)))
#calculate relative MAE
(rMAE5=mean(abs(ISE.rerror5)))
#calculate standard error of relative RMSE
(SE.rRMSE5=(sd(ISE.rerror5^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror5^2))))
#calculate standard error of relative MAE
(SE.rMAE5=sd(abs(ISE.rerror5))/sqrt(526))
#95% confidence interval for relative RMSE
rRMSE5-1.96*SE.rRMSE5; RMSE5+1.96*SE.rRMSE5
#95% confidence interval for relative MAE
rMAE5-1.96*SE.rMAE5; rMAE5+1.96*SE.rMAE5
#wilcoxon tests to compare the mothod of part e with different 4 methods from part d
wilcox.test(abs(ISE.error5),abs(ISE.error1), paired=TRUE)
wilcox.test(abs(ISE.error5),abs(ISE.error2), paired=TRUE)
wilcox.test(abs(ISE.error5),abs(ISE.error3), paired=TRUE)
wilcox.test(abs(ISE.error5),abs(ISE.error4), paired=TRUE)
33. Task 2 Code
Appendix C: References
1. Jeff Cartier. The Basics of Creating Graphs with SAS/GRAPH® Software. [online].
Available from: https://support.sas.com/rnd/datavisualization/papers/GraphBasics.pdf
[Accessed 24 February 2016]
2. Steven M. LaLonde. 2012. Transforming Variables for Normality and Linearity –
When, How, Why and Why Not's. [online]. Available from:
http://support.sas.com/resources/papers/proceedings12/430-2012.pdf [13 March 2016]
3. David L. Cassell. 2007. Don't Be Loopy: Re-Sampling and Simulation the SAS® Way.
[online]. Available from: http://www2.sas.com/proceedings/forum2007/183-2007.pdf
[14 March 2016]
4. Michael J. Wieczkowski. Alternatives to Merging SAS Data Sets … But Be Careful.
[online]. Available from: http://www.ats.ucla.edu/stat/sas/library/nesug99/bt150.pdf
[23 March 2016].