Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Report (istanbul stock exchange and resistance)

244 views

Published on

将来の株価を予想するモデルとして最も適切なものを、ベンチマーク実験などの統計を基とした実験の結果を分析して導きました。

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Report (istanbul stock exchange and resistance)

  1. 1. STAT7001 Computing for Practical Statistics In-Course Assessment 2 TASK 1: PREDICTION OF THE ISTANBUL STOCK MARKET 2 TASK 2: THE RESISTANCE OF CONSTANTIN 7 APPENDIX A: TASK 1 R CODE 11 APPENDIX B: TASK 2 SAS CODE 26 APPENDIX C: REFERENCES 33
  2. 2. Task 1: Prediction of the Istanbul Stock Market Task 1: Prediction of the Istanbul Stock Market Main Question The main task was to use different prediction strategies to predict the daily returns of the Istanbul Stock Exchange (ISE) index based on the data of ISE returns as well as the returns of 7 other stock indices; and compare the performance of these prediction methods by calculating error measures such as RMSE, MAE, and the relative variants of these. For the following report, we apply significance level 5% to all analyses. Summary For benchmarking experiments where ISE returns were predicted based on data from the same day, the models based on other stock indices were significantly better than taking the mean ISE return as a predictor; while the inclusion of time did not result in any significant changes in the goodness of prediction. For benchmarking experiments where predictions were made only based on previous data, the reverse was observed as predictors based only on prior ISE returns performed significantly better than models based on previous stock index returns, suggesting a non-linear relationship may exist between ISE returns and that of previous days. Exploratory Data Analysis Figures 1 to 8. Scatter plots of stock index returns (y-axis) against number of days since earliest record (x-axis), with respective correlation estimates and p-values. From the scatter plots in Figures 1 to 8, it can be seen that there is no apparent association between the returns of stock indices and time, as the location of the index returns do not appear to change with time. A correlation test was performed on each of the stock index returns and time, with the results indicating no apparent linear association between the variables at 95% confidence, as all p-values were greater than 0.05. Figure 1. ISE: cor=-0.0499, p-value=0.2485 Figure 2. S&P 500: cor=0.0245, p-value=0.5714 Figure 3. DAX: cor=0.0299, p-value=0.4891 Figure 4. FTSE 100: cor=0.0190, p-value=0.6615 Figure 5. Nikkei 225: cor=0.00533, p-value=0.9019 Figure 6. Ibovespa: cor=-0.0582, p-value=0.1786 Figure 7. MSCI EU: cor=0.0121, p-value=0.7803 Figure 8. MSCI EM: cor=-0.0538, p-value=0.2136
  3. 3. Task 1: Prediction of the Istanbul Stock Market Figures 9 to 16. Scatter plots of stock index returns for day N-1, N-2 and N-3 (y-axis) respectively (left to right) against stock index returns for day N (x-axis), with respective correlation estimates and p-values. The scatter plots in Figures 9 to 16 shows that there is generally no patterns in the stock indices returns against its returns one, two, and three days before. A correlation test has been carried out to confirm this and it suggests that there is no correlation for the all scatter plots except the first one in Figure 16, which shows that there is slightly positive association (cor=0.149) between MSCI EM returns and its returns one day earlier (p-value=0.0005403<0.05). In the light of the above, it may be reasonable to suggest that stock indices returns are not linearly related with the recent past few days in general. cor = 0.0188 p-value = 0.6651 cor = -0.0124 p-value = 0.7745 cor = -0.0337 p-value = 0.4369 Figure 9. ISE cor = -0.0608 p-value = 0.1605 cor = -0.0300 p-value = 0.4888 cor = -0.00845 p-value = 0.8456 Figure 10. S&P 500 cor = 0.00132 p-value = 0.9758 cor = -0.0262 p-value = 0.5453 cor = -0.0172 p-value = 0.6919 Figure 11. DAX cor = -0.00739 p-value = 0.8646 cor = -0.0276 p-value = 0.5248 cor = -0.0218 p-value = 0.6149 Figure 12. FTSE 100 cor = -0.0782 p-value = 0.07085 cor = 0.0261 p-value = 0.5479 cor = 0.000953 p-value = 0.9825 Figure 13. Nikkei 225 cor = -0.0485 p-value = 0.2626 cor = -0.0140 p-value = 0.7463 cor = -0.0457 p-value = 0.2921 Figure 14. Ibovespa cor = 0.00995 p-value = 0.8184 cor = -0.0420 p-value = 0.3323 cor = 0.00254 p-value = 0.9533 Figure 15. MSCI EU cor = 0.149 p-value = 0.0005403 cor = -0.0141 p-value = 0.7449 cor = 0.0489 p-value = 0.2599 Figure 16. MSCI EM
  4. 4. Task 1: Prediction of the Istanbul Stock Market Results and Interpretation of Prediction from Same Day Indices (Part C) Table 1. Table of error measures and their confidence intervals for different prediction methods, under respective validation set-ups. Validation Set-up Prediction Methods Linear Model w/o Time Linear Model w/ Time Robust Linear Regression Chronological 80-20 Split Mean V = 3695 p-value = 0.0123 V = 3688 p-value = 0.01307 V = 3612 p-value = 0.02473 Linear Model w/o Time V = 3031 p-value = 0.6601 V = 3135 p-value = 0.4455 Linear Model w/ Time V = 3163 p-value = 0.3953 5-Fold Cross-Validation Mean V = 96316 p-value = 1.121e-11 V = 96500 p-value = 7.849e-12 V = 96397 p-value = 9.587e-12 Linear Model w/o Time V = 72898 p-value = 0.7934 V = 68503 p-value = 0.3356 Linear Model w/ Time V = 68297 p-value = 0.3075 Table 2. Table of results for paired Wilcoxon signed-rank tests on absolute residuals of different prediction methods. The similar error measures for the linear models with and without time, as shown in Table 1, indicate that both models seem to perform as well as each other, for both the chronological and 5-fold cross-validation set-ups. This is confirmed by the paired Wilcoxon signed-rank test results in Table 2, with p-values of 0.6601 and 0.7934 (for chronological and 5-fold respectively) indicating that there is no significant difference in absolute residuals for the two models. These results agree with the preliminary conclusions derived from the exploratory findings. Since it was found that there is no apparent association between the stock index returns and time, thus it should follow that the linear models with or without time should perform just as well as each other in predicting ISE returns, as the addition of the time variable does not provide significant information. Additionally, a robust linear regression model (RLR) was created, predicting the ISE return based on the returns of other stock indices for the same day, without time. This also produced similar error measures to the ordinary least squares regression models, and paired Wilcoxon signed-rank test results similarly indicated no significant difference in absolute residuals (p-values of 0.4455 and 0.3953 for chronological; p-values of 0.3356 and 0.3075 for 5-fold). However, these 3 models have RMSE and MAE values which are considerably smaller than that of the prediction using mean ISE return for the training data set, as seen in Table 1. This is confirmed by the paired Validation Set-up Prediction Method RMSE MAE Relative RMSE Relative MAE Chronological 80-20 Split Mean 0.0131 (0.0112, 0.0149) 0.0100 (0.0084, 0.0116) 1.68 (1.13, 2.23) 1.22 (1.00, 1.44) Linear Model w/o Time 0.0108 (0.0094, 0.0122) 0.00855 (0.00729, 0.00980) 3.35 (1.63, 5.07) 1.61 (1.05, 2.17) Linear Model w/ Time 0.0107 (0.0093, 0.0121) 0.00852 (0.00729, 0.00975) 3.06 (1.63, 4.49) 1.56 (1.06, 2.06) Robust Linear Regression 0.0105 (0.0091, 0.0119) 0.00845 (0.00726, 0.00964) 3.04 (1.33, 4.74) 1.53 (1.03, 2.03) 5-Fold Cross-Validation Mean 0.0162 (0.0134, 0.0191) 0.0121 (0.0100, 0.0141) 1.49 (1.03, 1.95) 1.14 (0.96, 1.32) Linear Model w/o Time 0.0120 (0.0100, 0.0140) 0.00920 (0.00773, 0.01067) 7.43 (1.30, 13.55) 2.11 (0.76, 3.46) Linear Model w/ Time 0.0120 (0.0100, 0.0140) 0.00920 (0.00773, 0.01066) 7.46 (1.30, 13.62) 2.11 (0.75, 3.47) Robust Linear Regression 0.0121 (0.0101, 0.0142) 0.00928 (0.00780, 0.01076) 6.85 (1.36, 12.34) 2.02 (0.78, 3.26)
  5. 5. Task 1: Prediction of the Istanbul Stock Market Wilcoxon signed-rank tests. For the chronological set-up, the p-values of 0.0123, 0.01307, and 0.02473 (for tests of mean vs. LM w/o time, LM w/ time and RLR respectively) suggest some evidence of a difference in the absolute residuals of the models. For the 5-fold set-up, the p-values of 1.121e-11, 7.849e-12 and 9.587e- 12 respectively, suggest strong evidence of a significant difference in the absolute residuals from the models. This allows us to conclude with 95% confidence that the mean ISE return is a worse prediction method than any of the other 3 models, under both validation set-ups. Results and Interpretation of Prediction from Previous Day Indices (Part D) Validation Set-up Prediction Method RMSE MAE Relative RMSE Relative MAE 11 Consecutive Days Most Recent ISE Return 0.0221 (0.0203, 0.0240) 0.0170 (0.0158, 0.0182) 16.31 (6.83, 25.78) 4.23 (2.88, 5.57) Mean ISE Return of Recent 5 Days 0.0173 (0.0159, 0.0187) 0.0130 (0.0120, 0.0140) 4.50 (3.11, 5.89) 1.97 (1.63, 2.32) LM - Most Recent Day 0.724 (0.319, 1.129) 0.169 (0.110, 0.230) 221.3 (91.4, 351.2) 43.1 (24.5, 61.7) LM - Most Recent 2 Days 2.59 (0.31, 4.86) 0.274 (0.054, 0.494) 280.0 (134.7, 425.3) 49.1 (25.5, 72.6) Robust Linear Regression 0.0634 (0.0397, 0.0870) 0.0344 (0.0298, 0.0389) 25.89 (17.98, 7.98) 8.79 (6.70, 10.87) Table 3. Table of error measures and their confidence intervals for different prediction methods. Validation Set-up Prediction Method Mean ISE Return of Recent 5 Days LM - Most Recent Day LM - Most Recent 2 Days Robust Linear Regression 11 Consecutive Days Most Recent ISE Return V = 95496 p-value = 5.857e-14 V = 15103 p-value < 2.2e-16 V = 12427 p-value < 2.2e-16 V = 100020 p-value < 2.2e-16 Mean ISE Return of Recent 5 Days V = 10271 p-value < 2.2e-16 V = 7468 p-value < 2.2e-16 V = 111260 p-value < 2.2e-16 LM - Most Recent Day V = 65944 p-value = 0.3359 V = 25768 p-value < 2.2e-16 LM - Most Recent 2 Days V = 31006 p-value < 2.2e-16 Table 4. Table of results for paired Wilcoxon signed-rank tests on absolute residuals of different prediction methods. The error measures from the LM of stock index returns from the most recent day are all smaller than that of the LM from the most recent 2 days. However, the large standard error of these error measures suggest that this difference might not be significant; and this is confirmed by the paired Wilcoxon signed-rank test, with a p-value of 0.3359 indicating no significant difference in absolute residuals from both these models. A robust linear regression model was also created, predicting the ISE return based on the returns of stock indices from the most recent day. These are the same covariates as in the LM for most recent day; however, the method of obtaining the coefficients for each covariate is different and more robust, so the custom regression shows lower error values for all measures of prediction goodness. This is confirmed by the paired Wilcoxon signed-rank test, with p-value < 2.2e-16 indicating a significant difference in the absolute residuals from the two models. However, the prediction method of using mean ISE return of recent 5 days shows the lowest value for each error measure. Furthermore, the upper bound of the 95% CI of all 4 of its error measures are lower than the lower bound of the 95% CI of the error measures from all other models. The p-values from the paired Wilcoxon signed-rank tests (5.857e-14, <2.2e-16, <2.2e-16, <2.2e-16) of this method against all other methods also provide support that there are significant differences in the absolute residuals obtained from the model. Thus, there is a strong evidence to suggest that the prediction method of using the mean ISE return of recent 5 days is the best method among the five different prediction methods used in this benchmarking experiment.
  6. 6. Task 1: Prediction of the Istanbul Stock Market It should be noted that in the initial exploratory analysis, it was concluded that there seems to be no linear association between ISE stock index returns and its returns one, two, and three days before (Figure 9). However, the benchmark experiment in Part D suggests that the prediction method based on the recent 5 days appears to be the best prediction method, which is contradictory to the results of the exploratory analysis. This may suggest that there might be non-linear associations that exist between ISE stock index returns and its returns on the days before, thus allowing prediction to be made based on previous ISE returns. Alternatively, this may have happened due to poorly designed prediction methods, with the mean ISE return of the recent 5 days performing relatively better than the rest. Conclusion To assess the several prediction methods for the Istanbul stock market, two different benchmarking experiments have been performed in this task. One is predicted from other indices on the same day and the other one is predicted from all indices including ISE itself from previous recent days. The first benchmarking experiment showed that the prediction methods with least squares regression based on other data obtained from the same day, generally performed better than ones using the mean ISE return as a predictor, in terms of error measures such as RMSE and MAE. This was confirmed by the paired Wilcoxon signed-rank tests on the absolute residuals from the different prediction methods. Additionally, the inclusion of time as an additional covariate did not result in any significant changes in the goodness of prediction of the models, concurring with the results of the exploratory analysis. On the other hand, in the second benchmarking experiment, there was sufficient evidence to support the opposite situation to the first benchmarking experiment, where prediction models based only on prior ISE returns performed significantly better than models based on previous returns on all stock indices. This contradicts the results of the exploratory analysis, where there was no significant linear association found between ISE returns and its returns from days before, thus suggesting a possible non-linear relationship not identified by the correlation test. Comparing the benchmarking experiments in Part C and D, it can be seen that the error measures calculated in Part C tend to be smaller than those in Part D, in general. This might suggest that prediction models based on data from the same day are better at predicting the ISE returns than prediction models based only on recent previous data. This might indicate that data from the same day provides better information, or has closer associations to, the ISE returns.
  7. 7. Task 2: The Resistance of Constantin Task 2: The Resistance of Constantin Main Question The data from the 8th edition of the “Rubber Bible”, contains 16 data points of resistance of Constantin wire at different diameters. The main task was to fit different regression models to explain resistance in terms of diameter and investigate the goodness of fit of these models by obtaining estimates of error measures such as RMSE and MAE. The significance level of 5% was applied to all analyses for the following report. Summary Based on the investigation, regression models that involved logarithmic or reciprocal transformations generally had higher goodness of fit. It was found that the regression model of 1/d2 best explained the relationship between resistance and diameter, producing the simplest model with high goodness of fit and the smallest residuals. The model of 1/d2 + 1/d produced a similarly performing model, but the covariate 1/d was found to be insignificant; while the log-transformed model produced residuals that were larger than the 1/d2 model. Meanwhile, fitting resistance to a polynomial of degree 15 in diameter produced an over-fitted rank-deficient model. Explorative Analysis Figure 1 (a) and (b): Histograms of both variables R and d. Both resistance and diameter variables have positive values below 1. From the histograms, both can be deduced to be positively skewed, justified with the positive skewness, 2.9870 and 1.3166 respectively. To shrink the larger values more than the smaller ones, power or log transformation can be used to result in a distribution that is more symmetric. However, as all the values are between 0 and 1, power transformation might not achieve the desired effect. From the scatter plot (Figure 2), we can see that as diameter increases, the resistance decreases. There is clearly a decreasing non-linear relationship between the two variables. Suggested transformations on the data could include logarithmic or reciprocal transformations to straighten out the bivariate non-linear relationship. Figure 2: Scatterplot of Resistance of Constantin wire and its diameter.
  8. 8. Task 2: The Resistance of Constantin Regression Models There are 4 suggested regression models fitted. Namely,  model 1, log(R)=log(d);  model 2, R=d+d2 +d3 + d4 + d5 +d6 + d7 +d8 + d9 +d10 + d11 +d12 + d13 +d14 +d15 ;  model 3, and  model 4, . For model 1, the fit plot is negatively linear, as the logarithmic transformation of both variables gives negative values. Log(d) is a significant parameter as its p-value is lower than 0.05. The R-square of 1.00 indicates the model is a good fit. Meanwhile, model 2 is a rank-deficient least squares model. As such, the least-squares solutions for the parameters are not unique, producing biased estimates with some misleading statistics. Parameter Estimates R-Square Variable Parameter Estimate Standard Error t Value Pr > |t| 0.9944 Intercept 3.11362 0.21165 14.71 <.0001 d -417.05833 43.70357 -9.54 <.0001 d2 23289 3277.49356 7.11 0.0004 d3 -685537 119274 -5.75 0.0012 d4 11541180 2347752 4.92 0.0027 d5 -113141376 25879463 -4.37 0.0047 d6 617364760 154399878 4.00 0.0071 d7 -1539663710 412457801 -3.73 0.0097 d8 0 . . . d9 5151630410 1517635765 3.39 0.0146 d10 0 . . . d11 0 . . . d12 0 . . . d13 0 . . . d14 -4.28733E11 1.406149E11 -3.05 0.0225 d15 0 . . . Table 2: Parameter Estimate for Model 2. Table 2 shows that the model of resistance being a polynomial of degree 15 in diameter might be over fitted. Its R-square value of 0.9944 is the only high because every time you add a predictor to a model, the R- squared increases, supporting the argument that this model is not a good fit. Table 3: Parameter Estimate for Model 3. Model 3 is a rather good fit as it has R-square of 1. However, the p-value of 0.3798 for the parameter 1/d suggests that the parameter is not significant in the model. Based on this, model 4 is fitted with only 1/d2 , as a simpler model is usually preferred in general. Parameter Estimates R-Square Variable Parameter Estimate Standard Error t Value Pr > |t| 1.0000 Intercept -9.68163 0.00659 -1469.6 <.0001 Log(d) -1.99987 0.00208 -963.01 <.0001 Table 1: Parameter Estimate for Model 1. Parameter Estimates R-Square Variable Parameter Estimate Standard Error t Value Pr > |t| 1.0000 Intercept 0.00024745 0.00061146 0.40 0.6923 1/d2 0.00006279 2.538051E-7 247.39 <.0001 1/d -0.00002805 0.00003085 -0.91 0.3798
  9. 9. Task 2: The Resistance of Constantin Table 4: Parameter Estimate for Model 4. In model 4, the transformed variable 1/d2 is a significant parameter, with p-value of less than 0.0001. The fit plot is positively linear with an R-square of value 1. Based on the analysis so far, models 1 and 4 give the best fit compared to the other 2 models. These results agree with the suggested potential models derived from the exploratory findings. Further analysis has to be carried out to establish a conclusion. Cross Validation Leave-one-out cross validation was carried out to test a goodness of fit of the models. The models are compared below. Figure 3: Residual plots for the 4 models fitted. First, we consider the residual plots of all 4 models. As shown, model 1, 3 and 4 are relatively better fitted than model 2. Model 2 has one extreme residual value of over 20000, which causes the large scale of the residual plot. Moreover, all its residual values are relatively larger compared to other 3 models. Meanwhile, the residuals in model 1 are randomly scattered, but have values that are larger than that of model 3 and 4. Parameter Estimates R-Square Variable Parameter Estimate Standard Error t Value Pr > |t| 1.0000 Intercept -0.00020809 0.00034825 -0.60 0.5597 1/d2 0.00006257 7.917185E-8 790.31 <.0001
  10. 10. Task 2: The Resistance of Constantin The residual plots for Model 3 and 4 have the smallest scales among all plots. This suggests that model 3 and 4 have the smallest residuals, and thus could be better as regression models. Then, we look at the by estimates for out-of-sample RMSE and MAE obtained. Table 5: Table of error measures and their confidence intervals for different regression models. As expressed in the table above, the second suggested method (R=1/d2 ), which transforms diameter to the power of -2, shows the lowest value for both measures of prediction goodness. However, there is in overlap in the 95% confidence intervals of the error measures for R=1/d2 and R=1/d2 +1/d. We cannot conclude that model 4 is the best model yet. Hence, paired Wilcoxon signed-rank tests have been done between all 4 different models to confirm the results above. Models Log(R)=log(d) R=d+d2 +d3 +d4 +d5 +d6 +d7 +d8 +d9+d10 +d11 +d 12 +d13 +d14 +d15 R=1/d2 +1/d R=1/d2 Log(R)=log(d) S=65 p-value=0.0002 S=64 p-value=0.0002 S=68 p-value=<.0001 R=d+d2 +d3 +d4 +d5 +d6 + d7 +d8 +d9+d10 +d11 +d12 + d13 +d14 +d15 S=67 p-value=<.0001 S=67 p-value=<.0001 R=1/d2 +1/d S=5 p-value=0.8209 Table 6: Results of paired Wilcoxon signed-rank tests on absolute residuals. The tests support the deduction that the performances of the models are significantly different from one another, as most p-values are less than 0.05, indicating significant difference in the absolute residuals from each regression model. The exception to this is the 4th (R=1/d2 ) and 3rd models (R=1/d2 +1/d), with p- value=0.8209 suggesting no significant difference in model performances. Generally, models with less covariates is better when 2 models are similar. Thus, in this case, the 4th model is the best model in explaining resistance in terms of diameter. Conclusion Based on the investigation, Model 2 (R=d+d2 +d3 +d4 +d5 +d6 +d7 +d8 +d9 +d10 +d11 +d12 +d13 +d14 +d15 ) is an extreme example of fitting an overly complicated model to get a good fit. The model is too complex for the data even though it appears to explain a lot of variation in the response variable. Model 1 (Log(R)=log(d)) is relatively good but it does not have the lowest RMSE and MAE, suggesting that the residual it produces is relatively large. Meanwhile, Model 3 (R=1/d2 +1/d) has one insignificant covariate and leads to the second suggested model. In conclusion, Model 4 (R=1/d2 ) is the best model in explaining the resistance of Constantin wire in terms of varying diameter, producing the simplest model with high goodness of fit and the smallest residuals, as evidenced by the high R-squared value and low RMSE and MAE error measures. The model can be interpreted as when diameter of Constantin wire decreases, the square of diameter decreases, the reciprocal of the square of diameter increase, the resistance increases. Model RMSE MAE Suggested (i) Log(R)=log(d) 0.0093 (0.0092, 0.0094) 0.0065 (0.0049, 0.0081) Part (b) (i) R=d+d2 +d3 +d4 +d5 +d6 +d7 +d8 +d9+d10 +d11 +d12 +d13 +d14 +d15 5358.40 (2764.46, 7952.34) 1340.41 (0.91, 2679.91) Part (b) (ii) R=1/d2 +1/d 0.0019 (0.0014, 0.0024) 0.0009 (0.0005, 0.0013) Suggested (ii) R=1/d2 0.0013 (0.0009, 0.0017) 0.0006 (0.0003, 0.0009)
  11. 11. Task 1 Code Appendix A: Task 1 R Code # Part (a). Data load and conversation of day column. # Load data from CSV file. ISE_data=read.csv(file="C:/Documents/STAT7001/Istanbul.csv", header=TRUE, sep=",") # Convert the date column into a recognisable date format in R. ISE_data$date=as.POSIXct(ISE_data$date, format="%d-%b-%Y") # Find the difference in numbers of days, and round off any decimals. ISE_data$date<-difftime(ISE_data$date,ISE_data$date[1], units="days") ISE_data$date<-round(ISE_data$date,digits=0) ISE_data$date=as.numeric(as.character(ISE_data$date))
  12. 12. Task 1 Code # Part (b). Exploratory data analysis. # Association between index and time. plot(ISE_data[,1], ISE_data[,2], xlab="Days", ylab="ISE", abline(lm(ISE~date, ISE_data))) plot(ISE_data[,1], ISE_data[,3], xlab="Days", ylab="S&P 500", abline(lm(S.P.500~date, ISE_data))) plot(ISE_data[,1], ISE_data[,4], xlab="Days", ylab="DAX", abline(lm(DAX~date, ISE_data))) plot(ISE_data[,1], ISE_data[,5], xlab="Days", ylab="FTSE 100", abline(lm(FTSE~date, ISE_data))) plot(ISE_data[,1], ISE_data[,6], xlab="Days", ylab="Nikkei 225", abline(lm(NIKKEI~date, ISE_data))) plot(ISE_data[,1], ISE_data[,7], xlab="Days", ylab="Ibovespa", abline(lm(BOVESPA~date, ISE_data))) plot(ISE_data[,1], ISE_data[,8], xlab="Days", ylab="MSCI EU Index", abline(lm(MSCI.EU~date, ISE_data))) plot(ISE_data[,1], ISE_data[,9], xlab="Days", ylab="MSCI EM Index", abline(lm(MSCI.EM~date, ISE_data))) cor.test(ISE_data[,1], ISE_data[,2]) cor.test(ISE_data[,1], ISE_data[,3]) cor.test(ISE_data[,1], ISE_data[,4]) cor.test(ISE_data[,1], ISE_data[,5]) cor.test(ISE_data[,1], ISE_data[,6]) cor.test(ISE_data[,1], ISE_data[,7]) cor.test(ISE_data[,1], ISE_data[,8]) cor.test(ISE_data[,1], ISE_data[,9]) # Association between ISE index and index the days before. plot(ISE_data[c(2:536),2], ISE_data[c(1:535),2], xlab="ISE, Day N", ylab="ISE, Day N-1") plot(ISE_data[c(3:536),2], ISE_data[c(1:534),2], xlab="ISE, Day N", ylab="ISE, Day N-2") plot(ISE_data[c(4:536),2], ISE_data[c(1:533),2], xlab="ISE, Day N", ylab="ISE, Day N-3") cor.test(ISE_data[c(2:536),2], ISE_data[c(1:535),2]) cor.test(ISE_data[c(3:536),2], ISE_data[c(1:534),2]) cor.test(ISE_data[c(4:536),2], ISE_data[c(1:533),2]) # Association between S&P 500 index and index the days before. plot(ISE_data[c(2:536),3], ISE_data[c(1:535),3], xlab="S&P 500, Day N", ylab="S&P 500, Day N-1") plot(ISE_data[c(3:536),3], ISE_data[c(1:534),3], xlab="S&P 500, Day N", ylab="S&P 500, Day N-2") plot(ISE_data[c(4:536),3], ISE_data[c(1:533),3], xlab="S&P 500, Day N", ylab="S&P 500, Day N-3") cor.test(ISE_data[c(2:536),3], ISE_data[c(1:535),3]) cor.test(ISE_data[c(3:536),3], ISE_data[c(1:534),3]) cor.test(ISE_data[c(4:536),3], ISE_data[c(1:533),3]) # Association between DAX index and index the days before. plot(ISE_data[c(2:536),4], ISE_data[c(1:535),4], xlab="DAX, Day N", ylab="DAX, Day N-1") plot(ISE_data[c(3:536),4], ISE_data[c(1:534),4], xlab="DAX, Day N", ylab="DAX, Day N-2") plot(ISE_data[c(4:536),4], ISE_data[c(1:533),4], xlab="DAX, Day N", ylab="DAX, Day N-3") cor.test(ISE_data[c(2:536),4], ISE_data[c(1:535),4]) cor.test(ISE_data[c(3:536),4], ISE_data[c(1:534),4]) cor.test(ISE_data[c(4:536),4], ISE_data[c(1:533),4]) # Association between FTSE 100 index and index the days before. plot(ISE_data[c(2:536),5], ISE_data[c(1:535),5], xlab="FTSE 100, Day N", ylab="FTSE 100, Day N-1") plot(ISE_data[c(3:536),5], ISE_data[c(1:534),5], xlab="FTSE 100, Day N", ylab="FTSE 100, Day N-2") plot(ISE_data[c(4:536),5], ISE_data[c(1:533),5], xlab="FTSE 100, Day N", ylab="FTSE 100, Day N-3") cor.test(ISE_data[c(2:536),5], ISE_data[c(1:535),5]) cor.test(ISE_data[c(3:536),5], ISE_data[c(1:534),5]) cor.test(ISE_data[c(4:536),5], ISE_data[c(1:533),5])
  13. 13. Task 1 Code # Association between Nikkei 225 index and index the days before. plot(ISE_data[c(2:536),6], ISE_data[c(1:535),6], xlab="Nikkei 225, Day N", ylab="Nikkei 225, Day N-1") plot(ISE_data[c(3:536),6], ISE_data[c(1:534),6], xlab="Nikkei 225, Day N", ylab="Nikkei 225, Day N-2") plot(ISE_data[c(4:536),6], ISE_data[c(1:533),6], xlab="Nikkei 225, Day N", ylab="Nikkei 225, Day N-3") cor.test(ISE_data[c(2:536),6], ISE_data[c(1:535),6]) cor.test(ISE_data[c(3:536),6], ISE_data[c(1:534),6]) cor.test(ISE_data[c(4:536),6], ISE_data[c(1:533),6]) # Association between Ibovespa index and index the days before. plot(ISE_data[c(2:536),7], ISE_data[c(1:535),7], xlab="Ibovespa, Day N", ylab="Ibovespa, Day N-1") plot(ISE_data[c(3:536),7], ISE_data[c(1:534),7], xlab="Ibovespa, Day N", ylab="Ibovespa, Day N-2") plot(ISE_data[c(4:536),7], ISE_data[c(1:533),7], xlab="Ibovespa, Day N", ylab="Ibovespa, Day N-3") cor.test(ISE_data[c(2:536),7], ISE_data[c(1:535),7]) cor.test(ISE_data[c(3:536),7], ISE_data[c(1:534),7]) cor.test(ISE_data[c(4:536),7], ISE_data[c(1:533),7]) # Association between MSCI EU index and index the days before. plot(ISE_data[c(2:536),8], ISE_data[c(1:535),8], xlab="MSCI EU, Day N", ylab="MSCI EU, Day N-1") plot(ISE_data[c(3:536),8], ISE_data[c(1:534),8], xlab="MSCI EU, Day N", ylab="MSCI EU, Day N-2") plot(ISE_data[c(4:536),8], ISE_data[c(1:533),8], xlab="MSCI EU, Day N", ylab="MSCI EU, Day N-3") cor.test(ISE_data[c(2:536),8], ISE_data[c(1:535),8]) cor.test(ISE_data[c(3:536),8], ISE_data[c(1:534),8]) cor.test(ISE_data[c(4:536),8], ISE_data[c(1:533),8]) # Association between MSCI EM index and index the days before. plot(ISE_data[c(2:536),9], ISE_data[c(1:535),9], xlab="MSCI EM, Day N", ylab="MSCI EM, Day N-1") plot(ISE_data[c(3:536),9], ISE_data[c(1:534),9], xlab="MSCI EM, Day N", ylab="MSCI EM, Day N-2") plot(ISE_data[c(4:536),9], ISE_data[c(1:533),9], xlab="MSCI EM, Day N", ylab="MSCI EM, Day N-3") cor.test(ISE_data[c(2:536),9], ISE_data[c(1:535),9]) cor.test(ISE_data[c(3:536),9], ISE_data[c(1:534),9]) cor.test(ISE_data[c(4:536),9], ISE_data[c(1:533),9])
  14. 14. Task 1 Code # Part (c). Benchmarking with all data. # ---------------------------------------------------------------------------- # Creating functions for measures of prediction goodness and their std errors. # ---------------------------------------------------------------------------- # (i) Root mean squared error (RMSE) rmse=function(observed, fitted){ sqrt(mean((observed-fitted)^2)) } rmseSE=function(observed, fitted){ sd((observed-fitted)^2)/sqrt(length(observed))/(2*sqrt(mean((observed-fitted)^2))) } # (ii) Mean absolute error (MAE) mae=function(observed, fitted){ mean(abs(observed-fitted)) } maeSE=function(observed, fitted){ sd(abs(observed-fitted))/sqrt(length(observed)) } # (iii) Relative RMSE RELrmse=function(observed, fitted){ sqrt(mean(((observed-fitted)/observed)^2)) } RELrmseSE=function(observed, fitted){ sd(((observed-fitted)/observed)^2)/sqrt(length(observed))/ (2*sqrt(mean(((observed-fitted)/observed)^2))) } # (iv) Relative MAE RELmae=function(observed, fitted){ mean(abs((observed-fitted)/observed)) } RELmaeSE=function(observed, fitted){ sd(abs((observed-fitted)/observed))/sqrt(length(observed)) } # --------------------------------------------------------------------------------- # Comparison of prediction methods, using validation set-up (i). # i.e. Chronologically first 80% of data (428.8 or 429 entries) as training sample; # remaining data as test sample. # --------------------------------------------------------------------------------- # Prediction method (i): Mean # -- Predictor Chr.ISEmean=mean(ISE_data$ISE[c(1:429)]) # -- Predicted values Chr.ISEmean # -- Error measures Chr.mean.rmse = rmse(ISE_data$ISE[c(430:536)], Chr.ISEmean) Chr.mean.rmseSE = rmseSE(ISE_data$ISE[c(430:536)], Chr.ISEmean) Chr.mean.rmse-1.96*Chr.mean.rmseSE; Chr.mean.rmse+1.96*Chr.mean.rmseSE Chr.mean.mae = mae(ISE_data$ISE[c(430:536)], Chr.ISEmean) Chr.mean.maeSE = maeSE(ISE_data$ISE[c(430:536)], Chr.ISEmean) Chr.mean.mae-1.96*Chr.mean.maeSE; Chr.mean.mae+1.96*Chr.mean.maeSE Chr.mean.RELrmse = RELrmse(ISE_data$ISE[c(430:536)], Chr.ISEmean) Chr.mean.RELrmseSE = RELrmseSE(ISE_data$ISE[c(430:536)], Chr.ISEmean) Chr.mean.RELrmse-1.96*Chr.mean.RELrmseSE; Chr.mean.RELrmse+1.96*Chr.mean.RELrmseSE Chr.mean.RELmae = RELmae(ISE_data$ISE[c(430:536)], Chr.ISEmean) Chr.mean.RELmaeSE = RELmaeSE(ISE_data$ISE[c(430:536)], Chr.ISEmean) Chr.mean.RELmae-1.96*Chr.mean.RELmaeSE; Chr.mean.RELmae+1.96*Chr.mean.RELmaeSE
  15. 15. Task 1 Code # Prediction method (ii): Linear model excluding time. # -- Model Chr.LMnoTime=lm(ISE ~ S.P.500 + DAX + FTSE + NIKKEI + BOVESPA + MSCI.EU + MSCI.EM, data=ISE_data[c(1:429),]) summary(Chr.LMnoTime) # -- Predicted values Chr.LMnoTime.Pred=predict(Chr.LMnoTime, ISE_data[c(430:536),]) # -- Error measures Chr.LMnoTime.rmse = rmse(ISE_data$ISE[c(430:536)], Chr.LMnoTime.Pred) Chr.LMnoTime.rmseSE = rmseSE(ISE_data$ISE[c(430:536)], Chr.LMnoTime.Pred) Chr.LMnoTime.rmse-1.96*Chr.LMnoTime.rmseSE; Chr.LMnoTime.rmse+1.96*Chr.LMnoTime.rmseSE Chr.LMnoTime.mae = mae(ISE_data$ISE[c(430:536)], Chr.LMnoTime.Pred) Chr.LMnoTime.maeSE = maeSE(ISE_data$ISE[c(430:536)], Chr.LMnoTime.Pred) Chr.LMnoTime.mae-1.96*Chr.LMnoTime.maeSE; Chr.LMnoTime.mae+1.96*Chr.LMnoTime.maeSE Chr.LMnoTime.RELrmse = RELrmse(ISE_data$ISE[c(430:536)], Chr.LMnoTime.Pred) Chr.LMnoTime.RELrmseSE = RELrmseSE(ISE_data$ISE[c(430:536)], Chr.LMnoTime.Pred) Chr.LMnoTime.RELrmse-1.96*Chr.LMnoTime.RELrmseSE; Chr.LMnoTime.RELrmse+1.96*Chr.LMnoTime.RELrmseSE Chr.LMnoTime.RELmae = RELmae(ISE_data$ISE[c(430:536)], Chr.LMnoTime.Pred) Chr.LMnoTime.RELmaeSE = RELmaeSE(ISE_data$ISE[c(430:536)], Chr.LMnoTime.Pred) Chr.LMnoTime.RELmae-1.96*Chr.LMnoTime.RELmaeSE; Chr.LMnoTime.RELmae+1.96*Chr.LMnoTime.RELmaeSE # Prediction method (iii): Linear model including time. # -- Model Chr.LMwithTime=lm(ISE ~ date+S.P.500+DAX+FTSE+NIKKEI+BOVESPA+MSCI.EU+MSCI.EM, data=ISE_data[c(1:429),]) summary(Chr.LMwithTime) # -- Predicted values Chr.LMwithTime.Pred=predict(Chr.LMwithTime, ISE_data[c(430:536),]) # -- Error measures Chr.LMwithTime.rmse = rmse(ISE_data$ISE[c(430:536)], Chr.LMwithTime.Pred) Chr.LMwithTime.rmseSE = rmseSE(ISE_data$ISE[c(430:536)], Chr.LMwithTime.Pred) Chr.LMwithTime.rmse-1.96*Chr.LMwithTime.rmseSE; Chr.LMwithTime.rmse+1.96*Chr.LMwithTime.rmseSE Chr.LMwithTime.mae = mae(ISE_data$ISE[c(430:536)], Chr.LMwithTime.Pred) Chr.LMwithTime.maeSE = maeSE(ISE_data$ISE[c(430:536)], Chr.LMwithTime.Pred) Chr.LMwithTime.mae-1.96*Chr.LMwithTime.maeSE; Chr.LMwithTime.mae+1.96*Chr.LMwithTime.maeSE Chr.LMwithTime.RELrmse = RELrmse(ISE_data$ISE[c(430:536)], Chr.LMwithTime.Pred) Chr.LMwithTime.RELrmseSE = RELrmseSE(ISE_data$ISE[c(430:536)], Chr.LMwithTime.Pred) Chr.LMwithTime.RELrmse-1.96*Chr.LMwithTime.RELrmseSE; Chr.LMwithTime.RELrmse+1.96*Chr.LMwithTime.RELrmseSE Chr.LMwithTime.RELmae = RELmae(ISE_data$ISE[c(430:536)], Chr.LMwithTime.Pred) Chr.LMwithTime.RELmaeSE = RELmaeSE(ISE_data$ISE[c(430:536)], Chr.LMwithTime.Pred) Chr.LMwithTime.RELmae-1.96*Chr.LMwithTime.RELmaeSE; Chr.LMwithTime.RELmae+1.96*Chr.LMwithTime.RELmaeSE # Comparison of prediction methods. wilcox.test(abs(ISE_data$ISE[c(430:536)]-Chr.ISEmean), abs(ISE_data$ISE[c(430:536)]-Chr.LMnoTime.Pred), paired=TRUE) wilcox.test(abs(ISE_data$ISE[c(430:536)]-Chr.ISEmean), abs(ISE_data$ISE[c(430:536)]-Chr.LMwithTime.Pred), paired=TRUE) wilcox.test(abs(ISE_data$ISE[c(430:536)]-Chr.LMnoTime.Pred), abs(ISE_data$ISE[c(430:536)]-Chr.LMwithTime.Pred), paired=TRUE)
  16. 16. Task 1 Code # --------------------------------------------------------------- # Comparison of prediction methods, using validation set-up (ii). # i.e. Five-fold cross-validation with uniformly randomly sampled folds. # --------------------------------------------------------------- # Five-fold cross-validation data setup. # Create random permutation of values. set.seed(555) randperm=sample(nrow(ISE_data)) # Create lists with test folds and their respective training folds. trainfolds=list() testfolds=list() for(i in 1:5){ lower=floor((i-1)*nrow(ISE_data)/5)+1 upper=floor(i*nrow(ISE_data)/5) testfolds[[i]]=randperm[lower:upper] trainfolds[[i]]=setdiff(1:nrow(ISE_data),testfolds[[i]]) testfolds[[i]]=ISE_data[testfolds[[i]],] trainfolds[[i]]=ISE_data[trainfolds[[i]],] } # --------------------------------------------------------------- # Prediction method (i): Mean # -- Predictor Fol.ISEmean=list() for(i in 1:5){ Fol.ISEmean[[i]]=mean(trainfolds[[i]][[2]]) } # -- Predicted values Fol.ISEmean # -- Error measures # *** RMSE *** Fol.mean.rmse=list() for(i in 1:5){ Fol.mean.rmse[[i]]=rmse(testfolds[[i]]$ISE, Fol.ISEmean[[i]]) } Fol.mean.rmse=mean(as.numeric(Fol.mean.rmse)) # Standard Error Fol.mean.rmseSE=list() for(i in 1:5){ Fol.mean.rmseSE[[i]]=rmseSE(testfolds[[i]]$ISE, Fol.ISEmean[[i]]) } Fol.mean.rmseSE=mean(as.numeric(Fol.mean.rmseSE)) # Confidence Interval Fol.mean.rmse-1.96*Fol.mean.rmseSE; Fol.mean.rmse+1.96*Fol.mean.rmseSE # *** MAE *** Fol.mean.mae=list() for(i in 1:5){ Fol.mean.mae[[i]]=mae(testfolds[[i]]$ISE, Fol.ISEmean[[i]]) } Fol.mean.mae=mean(as.numeric(Fol.mean.mae)) # Standard Error Fol.mean.maeSE=list() for(i in 1:5){ Fol.mean.maeSE[[i]]=maeSE(testfolds[[i]]$ISE, Fol.ISEmean[[i]]) } Fol.mean.maeSE=mean(as.numeric(Fol.mean.maeSE)) # Confidence Interval Fol.mean.mae-1.96*Fol.mean.maeSE; Fol.mean.mae+1.96*Fol.mean.maeSE # *** Relative RMSE *** Fol.mean.RELrmse=list() for(i in 1:5){ Fol.mean.RELrmse[[i]]=RELrmse(testfolds[[i]]$ISE, Fol.ISEmean[[i]]) }
  17. 17. Task 1 Code Fol.mean.RELrmse=mean(as.numeric(Fol.mean.RELrmse)) # Standard Error Fol.mean.RELrmseSE=list() for(i in 1:5){ Fol.mean.RELrmseSE[[i]]=RELrmseSE(testfolds[[i]]$ISE, Fol.ISEmean[[i]]) } Fol.mean.RELrmseSE=mean(as.numeric(Fol.mean.RELrmseSE)) # Confidence Interval Fol.mean.RELrmse-1.96*Fol.mean.RELrmseSE; Fol.mean.RELrmse+1.96*Fol.mean.RELrmseSE # *** Relative MAE *** Fol.mean.RELmae=list() for(i in 1:5){ Fol.mean.RELmae[[i]]=RELmae(testfolds[[i]]$ISE, Fol.ISEmean[[i]]) } Fol.mean.RELmae=mean(as.numeric(Fol.mean.RELmae)) # Standard Error Fol.mean.RELmaeSE=list() for(i in 1:5){ Fol.mean.RELmaeSE[[i]]=RELmaeSE(testfolds[[i]]$ISE, Fol.ISEmean[[i]]) } Fol.mean.RELmaeSE=mean(as.numeric(Fol.mean.RELmaeSE)) # Confidence Interval Fol.mean.RELmae-1.96*Fol.mean.RELmaeSE; Fol.mean.RELmae+1.96*Fol.mean.RELmaeSE # --------------------------------------------------------------- # Prediction method (ii): Linear model excluding time. # -- Models Fol.LMnoTime=list() for(i in 1:5){ Fol.LMnoTime[[i]]=lm(ISE~S.P.500 + DAX + FTSE + NIKKEI + BOVESPA + MSCI.EU + MSCI.EM, data=trainfolds[[i]]) } # -- Predicted values Fol.LMnoTime.Pred=list() for(i in 1:5){ Fol.LMnoTime.Pred[[i]]=predict(Fol.LMnoTime[[i]], testfolds[[i]]) } # -- Error measures # *** RMSE *** Fol.LMnoTime.rmse=list() for(i in 1:5){ Fol.LMnoTime.rmse[[i]]=rmse(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]]) } Fol.LMnoTime.rmse=mean(as.numeric(Fol.LMnoTime.rmse)) # Standard Error Fol.LMnoTime.rmseSE=list() for(i in 1:5){ Fol.LMnoTime.rmseSE[[i]]=rmseSE(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]]) } Fol.LMnoTime.rmseSE=mean(as.numeric(Fol.LMnoTime.rmseSE)) # Confidence Interval Fol.LMnoTime.rmse-1.96*Fol.LMnoTime.rmseSE; Fol.LMnoTime.rmse+1.96*Fol.LMnoTime.rmseSE # *** MAE *** Fol.LMnoTime.mae=list() for(i in 1:5){ Fol.LMnoTime.mae[[i]]=mae(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]]) } Fol.LMnoTime.mae=mean(as.numeric(Fol.LMnoTime.mae)) # Standard Error Fol.LMnoTime.maeSE=list() for(i in 1:5){ Fol.LMnoTime.maeSE[[i]]=maeSE(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]]) } Fol.LMnoTime.maeSE=mean(as.numeric(Fol.LMnoTime.maeSE)) # Confidence Interval
  18. 18. Task 1 Code Fol.LMnoTime.mae-1.96*Fol.LMnoTime.maeSE; Fol.LMnoTime.mae+1.96*Fol.LMnoTime.maeSE # *** Relative RMSE *** Fol.LMnoTime.RELrmse=list() for(i in 1:5){ Fol.LMnoTime.RELrmse[[i]]=RELrmse(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]]) } Fol.LMnoTime.RELrmse=mean(as.numeric(Fol.LMnoTime.RELrmse)) # Standard Error Fol.LMnoTime.RELrmseSE=list() for(i in 1:5){ Fol.LMnoTime.RELrmseSE[[i]]=RELrmseSE(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]]) } Fol.LMnoTime.RELrmseSE=mean(as.numeric(Fol.LMnoTime.RELrmseSE)) # Confidence Interval Fol.LMnoTime.RELrmse-1.96*Fol.LMnoTime.RELrmseSE; Fol.LMnoTime.RELrmse+1.96*Fol.LMnoTime.RELrmseSE # *** Relative MAE *** Fol.LMnoTime.RELmae=list() for(i in 1:5){ Fol.LMnoTime.RELmae[[i]]=RELmae(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]]) } Fol.LMnoTime.RELmae=mean(as.numeric(Fol.LMnoTime.RELmae)) # Standard Error Fol.LMnoTime.RELmaeSE=list() for(i in 1:5){ Fol.LMnoTime.RELmaeSE[[i]]=RELmaeSE(testfolds[[i]]$ISE, Fol.LMnoTime.Pred[[i]]) } Fol.LMnoTime.RELmaeSE=mean(as.numeric(Fol.LMnoTime.RELmaeSE)) # Confidence Interval Fol.LMnoTime.RELmae-1.96*Fol.LMnoTime.RELmaeSE; Fol.LMnoTime.RELmae+1.96*Fol.LMnoTime.RELmaeSE # --------------------------------------------------------------- # Prediction method (iii): Linear model including time. # -- Models Fol.LMwithTime=list() for(i in 1:5){ Fol.LMwithTime[[i]]=lm(ISE~date+S.P.500+DAX+FTSE+NIKKEI+BOVESPA+MSCI.EU+MSCI.EM, data=trainfolds[[i]]) } # -- Predicted values Fol.LMwithTime.Pred=list() for(i in 1:5){ Fol.LMwithTime.Pred[[i]]=predict(Fol.LMwithTime[[i]], testfolds[[i]]) } # -- Error measures # *** RMSE *** Fol.LMwithTime.rmse=list() for(i in 1:5){ Fol.LMwithTime.rmse[[i]]=rmse(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]]) } Fol.LMwithTime.rmse=mean(as.numeric(Fol.LMwithTime.rmse)) # Standard Error Fol.LMwithTime.rmseSE=list() for(i in 1:5){ Fol.LMwithTime.rmseSE[[i]]=rmseSE(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]]) } Fol.LMwithTime.rmseSE=mean(as.numeric(Fol.LMwithTime.rmseSE)) # Confidence Interval Fol.LMwithTime.rmse-1.96*Fol.LMwithTime.rmseSE; Fol.LMwithTime.rmse+1.96*Fol.LMwithTime.rmseSE # *** MAE *** Fol.LMwithTime.mae=list() for(i in 1:5){
  19. 19. Task 1 Code Fol.LMwithTime.mae[[i]]=mae(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]]) } Fol.LMwithTime.mae=mean(as.numeric(Fol.LMwithTime.mae)) # Standard Error Fol.LMwithTime.maeSE=list() for(i in 1:5){ Fol.LMwithTime.maeSE[[i]]=maeSE(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]]) } Fol.LMwithTime.maeSE=mean(as.numeric(Fol.LMwithTime.maeSE)) # Confidence Interval Fol.LMwithTime.mae-1.96*Fol.LMwithTime.maeSE; Fol.LMwithTime.mae+1.96*Fol.LMwithTime.maeSE # *** Relative RMSE *** Fol.LMwithTime.RELrmse=list() for(i in 1:5){ Fol.LMwithTime.RELrmse[[i]]=RELrmse(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]]) } Fol.LMwithTime.RELrmse=mean(as.numeric(Fol.LMwithTime.RELrmse)) # Standard Error Fol.LMwithTime.RELrmseSE=list() for(i in 1:5){ Fol.LMwithTime.RELrmseSE[[i]]=RELrmseSE(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]]) } Fol.LMwithTime.RELrmseSE=mean(as.numeric(Fol.LMwithTime.RELrmseSE)) # Confidence Interval Fol.LMwithTime.RELrmse-1.96*Fol.LMwithTime.RELrmseSE; Fol.LMwithTime.RELrmse+1.96*Fol.LMwithTime.RELrmseSE # *** Relative MAE *** Fol.LMwithTime.RELmae=list() for(i in 1:5){ Fol.LMwithTime.RELmae[[i]]=RELmae(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]]) } Fol.LMwithTime.RELmae=mean(as.numeric(Fol.LMwithTime.RELmae)) # Standard Error Fol.LMwithTime.RELmaeSE=list() for(i in 1:5){ Fol.LMwithTime.RELmaeSE[[i]]=RELmaeSE(testfolds[[i]]$ISE, Fol.LMwithTime.Pred[[i]]) } Fol.LMwithTime.RELmaeSE=mean(as.numeric(Fol.LMwithTime.RELmaeSE)) # Confidence Interval Fol.LMwithTime.RELmae-1.96*Fol.LMwithTime.RELmaeSE; Fol.LMwithTime.RELmae+1.96*Fol.LMwithTime.RELmaeSE # --------------------------------------------------------------- # Comparison of prediction methods. # Vector of residuals for prediction method (i). Fol.ISEmean.resid=list() for(i in 1:5){ Fol.ISEmean.resid[[i]]=testfolds[[i]]$ISE-Fol.ISEmean[[i]] } Fol.ISEmean.resid=unlist(Fol.ISEmean.resid) # Vector of residuals for prediction method (ii). Fol.LMnoTime.resid=list() for(i in 1:5){ Fol.LMnoTime.resid[[i]]=testfolds[[i]]$ISE-Fol.LMnoTime.Pred[[i]] } Fol.LMnoTime.resid=unlist(Fol.LMnoTime.resid) # Vector of residuals for prediction method (iii). Fol.LMwithTime.resid=list() for(i in 1:5){ Fol.LMwithTime.resid[[i]]=testfolds[[i]]$ISE-Fol.LMwithTime.Pred[[i]] } Fol.LMwithTime.resid=unlist(Fol.LMwithTime.resid) # Test for comparison of prediction methods. wilcox.test(abs(Fol.ISEmean.resid), abs(Fol.LMnoTime.resid), paired=TRUE) wilcox.test(abs(Fol.ISEmean.resid), abs(Fol.LMwithTime.resid), paired=TRUE) wilcox.test(abs(Fol.LMnoTime.resid), abs(Fol.LMwithTime.resid), paired=TRUE)
  20. 20. Task 1 Code # Part (d). Benchmarking with previous data. #create a vector of errors for RMSE and MAE in (i) ISE.error1=vector(mode="numeric", length=526) result.index=0 for(n in 11:536){ result.index=result.index+1 error1=ISE_data[n,2]-ISE_data[n-1,2] ISE.error1[result.index]=error1 } #calculate RMSE for (i) (RMSE1=sqrt(mean(ISE.error1^2))) #calculate MAE for (i) (MAE1=mean(abs(ISE.error1))) #calculate standard error of RMSE for (i) (SE.RMSE1=(sd(ISE.error1^2)/sqrt(526))/(2*sqrt(mean(ISE.error1^2)))) #calculate standard error of MAE for (i) (SE.MAE1=sd(abs(ISE.error1))/sqrt(526)) #95% confidence interval for RMSE RMSE1-1.96*SE.RMSE1; RMSE1+1.96*SE.RMSE1 #95% confidence interval for MAE MAE1-1.96*SE.MAE1; MAE1+1.96*SE.MAE1 #create a vector of errors for relative RMSE and relative MAE in (i) ISE.rerror1=ISE.error1/ISE_data[c(11:536),2] (rRMSE1=sqrt(mean(ISE.rerror1^2))) #calculate relatove MAE for (i) (rMAE1=mean(abs(ISE.rerror1))) #calculate standard error of relative RMSE for (i) (SE.rRMSE1=(sd(ISE.rerror1^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror1^2)))) #calculate standard error of relative MAE for (i) (SE.rMAE1=sd(abs(ISE.rerror1))/sqrt(526)) #95% confidence interval for relaitive RMSE rRMSE1-1.96*SE.rRMSE1; rRMSE1+1.96*SE.rRMSE1 #95% confidence interval for relative MAE rMAE1-1.96*SE.rMAE1; rMAE1+1.96*SE.rMAE1 ############ #create a vector of errors for RMSE and MAE in (ii) ISE.error2=vector(mode="numeric", length=526) result.index=0 for(n in 11:536){ result.index=result.index+1 error2=ISE_data[n,2]-mean(ISE_data[c((n-5):(n-1)),2]) ISE.error2[result.index]=error2 } #calculate RMSE for (ii) (RMSE2=sqrt(mean(ISE.error2^2))) #calculate MAE for (ii) (MAE2=mean(abs(ISE.error2))) #calculate standard error of RMSE for (ii) (SE.RMSE2=(sd(ISE.error2^2)/sqrt(526))/(2*sqrt(mean(ISE.error2^2)))) #calculate standard error of MAE for (ii) (SE.MAE2=sd(abs(ISE.error2))/sqrt(526)) #95% confidence interval for RMSE RMSE2-1.96*SE.RMSE2; RMSE2+1.96*SE.RMSE2 #95% confidence interval for MAE MAE2-1.96*SE.MAE2; MAE2+1.96*SE.MAE2 #create a vector of errors for relative RMSE and relative MAE in (ii) ISE.rerror2=ISE.error2/ISE_data[c(11:536),2]
  21. 21. Task 1 Code #calculate relative RMSE for (ii) (rRMSE2=sqrt(mean(ISE.rerror2^2))) #calculate relative MAE for (ii) (rMAE2=mean(abs(ISE.rerror2))) #calculate standard error of relative RMSE for (ii) (SE.rRMSE2=(sd(ISE.rerror2^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror2^2)))) #calculate standard error of relative MAE for (ii) (SE.rMAE2=sd(abs(ISE.rerror2))/sqrt(526)) #95% confidence interval for relative RMSE rRMSE2-1.96*SE.rRMSE2; rRMSE2+1.96*SE.rRMSE2 #95% confidence interval for relative MAE rMAE2-1.96*SE.rMAE2; rMAE2+1.96*SE.rMAE2 ################################################################################# #create a vector of errors for RMSE and MAE in (iii) ISE_data.iii=ISE_data[-536,] ISE_data.iii$ISE.predicted=ISE_data$ISE[2:536] ISE.error3=vector(mode="numeric", length=526) result.index=0 for(n in 10:535){ result.index=result.index+1 lmmodel3=lm(ISE.predicted~ISE+S.P.500+DAX+FTSE+NIKKEI+BOVESPA+MSCI.EU+MSCI.EM, data=ISE_data.iii[(n-9):(n-1),]) error3=ISE_data.iii[n,10]-predict(lmmodel3, ISE_data.iii[n,]) ISE.error3[result.index]=error3 } #calculate RMSE for (iii) (RMSE3=sqrt(mean(ISE.error3^2))) #calculate MAE for (iii) (MAE3=mean(abs(ISE.error3))) #calculate standard error of RMSE for (iii) (SE.RMSE3=(sd(ISE.error3^2)/sqrt(526))/(2*sqrt(mean(ISE.error3^2)))) #calculate standard error of MAE for (iii) (SE.MAE3=sd(abs(ISE.error3))/sqrt(526)) #95% confidence interval for RMSE RMSE3-1.96*SE.RMSE3; RMSE3+1.96*SE.RMSE3 #95% confidence interval for MAE MAE3-1.96*SE.MAE3; MAE3+1.96*SE.MAE3 #create a vector of errors for relative RMSE and relative MAE in (iii) ISE.rerror3=ISE.error3/ISE_data.iii[c(10:535),10] #calculate relative RMSE for (iii) (rRMSE3=sqrt(mean(ISE.rerror3^2))) #calculate relative MAE for (iii) (rMAE3=mean(abs(ISE.rerror3))) #calculate standard error of relative RMSE for (iii) (SE.rRMSE3=(sd(ISE.rerror3^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror3^2)))) #calculate standard error of relative MAE for (iii) (SE.rMAE3=sd(abs(ISE.rerror3))/sqrt(526)) #95% confidence interval for relative RMSE rRMSE3-1.96*SE.rRMSE3; rRMSE3+1.96*SE.rRMSE3 #95% confidence interval for relative MAE rMAE3-1.96*SE.rMAE3; rMAE3+1.96*SE.rMAE3
  22. 22. Task 1 Code ######################################################################################## #create a vector of errors for RMSE and MAE in (iv) ISE_data.iv=ISE_data[-c(535,536),] ISE_data.extracted=ISE_data[-c(1,536),-1] ISE_data.iv=cbind(ISE_data.iv,ISE_data.extracted) ISE_data.iv$ISE.predicted=ISE_data[-c(1,2),2] names(ISE_data.iv)=c("date","ISE2","S.P.5002","DAX2","FTSE2", "NIKKEI2","BOVESPA2","MSCI.EU2","MSCI.EM2", "ISE1","S.P.5001","DAX1","FTSE1", "NIKKEI1","BOVESPA1","MSCI.EU1","MSCI.EM1","ISE.predicted") ISE.error4=vector(mode="numeric", length=526) result.index=0 for(n in 9:534){ result.index=result.index+1 lmmodel4=lm(ISE.predicted~ISE2+S.P.5002+DAX2+FTSE2+NIKKEI2+BOVESPA2+MSCI.EU2+MSCI.EM2 +ISE1+S.P.5001+DAX1+FTSE1+NIKKEI1+BOVESPA1+MSCI.EU1+MSCI.EM1, data=ISE_data.iv[(n-8):(n-1),]) error4=ISE_data.iv[n,18]-predict(lmmodel4, ISE_data.iv[n,]) ISE.error4[result.index]=error4 } #calculate RMSE for (iv) (RMSE4=sqrt(mean(ISE.error4^2))) #calculate MAE for (iv) (MAE4=mean(abs(ISE.error4))) #calculate standard error of RMSE for (iv) (SE.RMSE4=(sd(ISE.error4^2)/sqrt(526))/(2*sqrt(mean(ISE.error4^2)))) #calculate standard error of MAE for (iv) (SE.MAE4=sd(abs(ISE.error4))/sqrt(526)) #95% confidence interval for RMSE RMSE4-1.96*SE.RMSE4; RMSE4+1.96*SE.RMSE4 #95% confidence interval for MAE MAE4-1.96*SE.MAE4; MAE4+1.96*SE.MAE4 ######### #create a vector of errors for relative RMSE and relative MAE in (iv) ISE.rerror4=ISE.error4/ISE_data.iv[c(9:534),18] #calculate relative RMSE for (iv) (rRMSE4=sqrt(mean(ISE.rerror4^2))) #calculate relative MAE for (iv) (rMAE4=mean(abs(ISE.rerror4))) #calculate standard error of relative RMSE for (iv) (SE.rRMSE4=(sd(ISE.rerror4^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror4^2)))) #calculate standard error of relative MAE for (iv) (SE.rMAE4=sd(abs(ISE.rerror4))/sqrt(526)) #95% confidence interval for relative RMSE rRMSE4-1.96*SE.rRMSE4; rRMSE4+1.96*SE.rRMSE4 #95% confidence interval for relative MAE rMAE4-1.96*SE.rMAE4; rMAE4+1.96*SE.rMAE4 ####################################################################################### #wilcoxon tests to compare the 4 different methods wilcox.test(abs(ISE.error1),abs(ISE.error2), paired=TRUE) wilcox.test(abs(ISE.error1),abs(ISE.error3), paired=TRUE) wilcox.test(abs(ISE.error1),abs(ISE.error4), paired=TRUE) wilcox.test(abs(ISE.error2),abs(ISE.error3), paired=TRUE) wilcox.test(abs(ISE.error2),abs(ISE.error4), paired=TRUE) wilcox.test(abs(ISE.error3),abs(ISE.error4), paired=TRUE)
  23. 23. Task 1 Code # Part (e)-(c). Robust linear regression with Part (c) validation setups. # ------------------------------ # Creating function for R(beta). # ------------------------------ Rbeta=function(beta, covariates, observed){ sum(abs(as.matrix(covariates)%*%matrix(beta)-observed)) } # --------------------------------------------------------- # Validation set-up (i). Chronological 80-20 split of data. # --------------------------------------------------------- # Prediction method (iv). Robust linear regression # -- Model Chr.PartE=nlm(Rbeta, p=c(-1,-1,-1,-1,-1,-1,-1), observed=ISE_data$ISE[1:429], covariates=ISE_data[1:429,3:9]) # -- Predicted values Chr.PartE.Pred=as.matrix(ISE_data[430:536,3:9]) %*% matrix(Chr.PartE$estimate) # -- Error measures Chr.PartE.rmse = rmse(ISE_data$ISE[c(430:536)], Chr.PartE.Pred) Chr.PartE.rmseSE = rmseSE(ISE_data$ISE[c(430:536)], Chr.PartE.Pred) Chr.PartE.rmse-1.96*Chr.PartE.rmseSE; Chr.PartE.rmse+1.96*Chr.PartE.rmseSE Chr.PartE.mae = mae(ISE_data$ISE[c(430:536)], Chr.PartE.Pred) Chr.PartE.maeSE = maeSE(ISE_data$ISE[c(430:536)], Chr.PartE.Pred) Chr.PartE.mae-1.96*Chr.PartE.maeSE; Chr.PartE.mae+1.96*Chr.PartE.maeSE Chr.PartE.RELrmse = RELrmse(ISE_data$ISE[c(430:536)], Chr.PartE.Pred) Chr.PartE.RELrmseSE = RELrmseSE(ISE_data$ISE[c(430:536)], Chr.PartE.Pred) Chr.PartE.RELrmse-1.96*Chr.PartE.RELrmseSE; Chr.PartE.RELrmse+1.96*Chr.PartE.RELrmseSE Chr.PartE.RELmae = RELmae(ISE_data$ISE[c(430:536)], Chr.PartE.Pred) Chr.PartE.RELmaeSE = RELmaeSE(ISE_data$ISE[c(430:536)], Chr.PartE.Pred) Chr.PartE.RELmae-1.96*Chr.PartE.RELmaeSE; Chr.PartE.RELmae+1.96*Chr.PartE.RELmaeSE # Comparison of prediction methods. wilcox.test(abs(ISE_data$ISE[c(430:536)]-Chr.ISEmean), abs(ISE_data$ISE[c(430:536)]-Chr.PartE.Pred), paired=TRUE) wilcox.test(abs(ISE_data$ISE[c(430:536)]-Chr.LMnoTime.Pred), abs(ISE_data$ISE[c(430:536)]-Chr.PartE.Pred), paired=TRUE) wilcox.test(abs(ISE_data$ISE[c(430:536)]-Chr.LMwithTime.Pred), abs(ISE_data$ISE[c(430:536)]-Chr.PartE.Pred), paired=TRUE) # --------------------------------------------------- # Validation set-up (ii). Five-fold cross-validation. # --------------------------------------------------- # Prediction method (iv). Robust linear regression # -- Models Fol.PartE=list() for(i in c(1,3,5)){ Fol.PartE[[i]]=nlm(Rbeta, p=c(-0.5,-0.5,-0.5,-0.5,-0.5,-0.5,-0.5), observed=trainfolds[[i]]$ISE, covariates=trainfolds[[i]][c(3:9)]) } for(i in c(2,4)){ Fol.PartE[[i]]=nlm(Rbeta, p=c(-1,-1,-1,-1,-1,-1,-1), observed=trainfolds[[i]]$ISE, covariates=trainfolds[[i]][c(3:9)]) } # -- Predicted values Fol.PartE.Pred=list() for(i in 1:5){ Fol.PartE.Pred[[i]]=as.matrix(testfolds[[i]][c(3:9)])%*%matrix(Fol.PartE[[i]]$estimate) } # -- Error measures # *** RMSE *** Fol.PartE.rmse=list()
  24. 24. Task 1 Code for(i in 1:5){ Fol.PartE.rmse[[i]]=rmse(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]]) } Fol.PartE.rmse=mean(as.numeric(Fol.PartE.rmse)) # Standard Error Fol.PartE.rmseSE=list() for(i in 1:5){ Fol.PartE.rmseSE[[i]]=rmseSE(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]]) } Fol.PartE.rmseSE=mean(as.numeric(Fol.PartE.rmseSE)) # Confidence Interval Fol.PartE.rmse-1.96*Fol.PartE.rmseSE; Fol.PartE.rmse+1.96*Fol.PartE.rmseSE # *** MAE *** Fol.PartE.mae=list() for(i in 1:5){ Fol.PartE.mae[[i]]=mae(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]]) } Fol.PartE.mae=mean(as.numeric(Fol.PartE.mae)) # Standard Error Fol.PartE.maeSE=list() for(i in 1:5){ Fol.PartE.maeSE[[i]]=maeSE(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]]) } Fol.PartE.maeSE=mean(as.numeric(Fol.PartE.maeSE)) # Confidence Interval Fol.PartE.mae-1.96*Fol.PartE.maeSE; Fol.PartE.mae+1.96*Fol.PartE.maeSE # *** Relative RMSE *** Fol.PartE.RELrmse=list() for(i in 1:5){ Fol.PartE.RELrmse[[i]]=RELrmse(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]]) } Fol.PartE.RELrmse=mean(as.numeric(Fol.PartE.RELrmse)) # Standard Error Fol.PartE.RELrmseSE=list() for(i in 1:5){ Fol.PartE.RELrmseSE[[i]]=RELrmseSE(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]]) } Fol.PartE.RELrmseSE=mean(as.numeric(Fol.PartE.RELrmseSE)) # Confidence Interval Fol.PartE.RELrmse-1.96*Fol.PartE.RELrmseSE; Fol.PartE.RELrmse+1.96*Fol.PartE.RELrmseSE # *** Relative MAE *** Fol.PartE.RELmae=list() for(i in 1:5){ Fol.PartE.RELmae[[i]]=RELmae(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]]) } Fol.PartE.RELmae=mean(as.numeric(Fol.PartE.RELmae)) # Standard Error Fol.PartE.RELmaeSE=list() for(i in 1:5){ Fol.PartE.RELmaeSE[[i]]=RELmaeSE(testfolds[[i]]$ISE, Fol.PartE.Pred[[i]]) } Fol.PartE.RELmaeSE=mean(as.numeric(Fol.PartE.RELmaeSE)) # Confidence Interval Fol.PartE.RELmae-1.96*Fol.PartE.RELmaeSE; Fol.PartE.RELmae+1.96*Fol.PartE.RELmaeSE # Comparison of prediction methods. # Vector of residuals for prediction method (iv). Fol.PartE.resid=list() for(i in 1:5){ Fol.PartE.resid[[i]]=testfolds[[i]]$ISE - Fol.PartE.Pred[[i]] } Fol.PartE.resid=unlist(Fol.PartE.resid) # Test for comparison of prediction methods. wilcox.test(abs(Fol.ISEmean.resid), abs(Fol.PartE.resid), paired=TRUE) wilcox.test(abs(Fol.LMnoTime.resid), abs(Fol.PartE.resid), paired=TRUE) wilcox.test(abs(Fol.LMwithTime.resid), abs(Fol.PartE.resid), paired=TRUE)
  25. 25. Task 1 Code # Part (e)-(d). Robust linear regression with Part (d) validation setup. #create a vector of errors for RMSE and MAE for 526 data splits ISE.error5=vector(mode="numeric", length=526) result.index=0 for(n in 10:535){ result.index=result.index+1 Sum.residuals=function(be,x,y){ res=be%*%t(x) SAR=sum(abs(res-y)) return(SAR) } beta=nlm(Sum.residuals, p=c(10,10,10,10,10,10,10,10), x=ISE_data.iii[(n-9):(n-1),-c(1,10)], y=ISE_data.iii$ISE.predicted[(n-9):(n-1)], iterlim=300)$estimate error5=ISE_data.iii$ISE.predicted[n]-beta%*%t(ISE_data.iii[n,2:9]) ISE.error5[result.index]=error5 } #calculate RMSE (RMSE5=sqrt(mean(ISE.error5^2))) #calculate MAE (MAE5=mean(abs(ISE.error5))) #calculate standard error of RMSE (SE.RMSE5=(sd(ISE.error5^2)/sqrt(526))/(2*sqrt(mean(ISE.error5^2)))) #calculate standard error of MAE (SE.MAE5=sd(abs(ISE.error5))/sqrt(526)) #95% confidence interval for RMSE RMSE5-1.96*SE.RMSE5; RMSE5+1.96*SE.RMSE5 #95% confidence interval for MAE MAE5-1.96*SE.MAE5; MAE5+1.96*SE.MAE5 #create a vector of errors for relative RMSE and relative MAE in (i) ISE.rerror5=ISE.error5/ISE_data[c(11:536),2] #calculate relative RMSE (rRMSE5=sqrt(mean(ISE.rerror5^2))) #calculate relative MAE (rMAE5=mean(abs(ISE.rerror5))) #calculate standard error of relative RMSE (SE.rRMSE5=(sd(ISE.rerror5^2)/sqrt(526))/(2*sqrt(mean(ISE.rerror5^2)))) #calculate standard error of relative MAE (SE.rMAE5=sd(abs(ISE.rerror5))/sqrt(526)) #95% confidence interval for relative RMSE rRMSE5-1.96*SE.rRMSE5; RMSE5+1.96*SE.rRMSE5 #95% confidence interval for relative MAE rMAE5-1.96*SE.rMAE5; rMAE5+1.96*SE.rMAE5 #wilcoxon tests to compare the mothod of part e with different 4 methods from part d wilcox.test(abs(ISE.error5),abs(ISE.error1), paired=TRUE) wilcox.test(abs(ISE.error5),abs(ISE.error2), paired=TRUE) wilcox.test(abs(ISE.error5),abs(ISE.error3), paired=TRUE) wilcox.test(abs(ISE.error5),abs(ISE.error4), paired=TRUE)
  26. 26. Task 2 Code Appendix B: Task 2 SAS Code libname cps "C:/Users/User/Documents/STAT7001/cps"; data cps.Rd; input R d; datalines; 0.00093 0.2588 0.00148 0.2053 0.0024 0.1628 0.0037 0.1291 0.0059 0.1024 0.0095 0.08118 0.0150 0.06438 0.024 0.05106 0.038 0.04049 0.048 0.03606 0.061 0.03211 0.096 0.02546 0.153 0.02019 0.24 0.01601 0.39 0.01270 0.98 0.00799 run; PROC print; run; *TASK 2(a); *1(a); *setting the font size to 12pt; goptions device=gif hsize=4in vsize=3in border ftext="sasfont" htext=12pt; proc univariate data=cps.Rd; var R d; histogram; qqplot / normal(mu=est sigma=est); run; title; title2 "Resistance versus diameter"; symbol1 value=plus colour=red; axis1 label=("Diameter(cm)"); axis2 label=(angle=90 "Resistance(Ohm)"); proc gplot data=cps.Rd; plot R*d /haxis=axis1 vaxis=axis2; run; proc reg data=cps.Rd; model R=d; run; *TASK 2 (b); data cps.Rd2; set cps.Rd; logR=log(R); recd2=1/(d**2); recd=1/d; logd=log(d); d2=d**2; d3=d**3; d4=d**4; d5=d**5; d6=d**6; d7=d**7;
  27. 27. Task 2 Code d8=d**8; d9=d**9; d10=d**10; d11=d**11; d12=d**12; d13=d**13; d14=d**14; d15=d**15; run; PROC print; run; *suggested i; PROC reg data=cps.Rd2; model logR = logd; run; *TASK 2 (b) i; proc reg data=cps.Rd2; model R=d d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15; run; *TASK 2 (b) ii; PROC reg data=cps.Rd2; model R = recd2 recd; run; *suggested ii; PROC reg data=cps.Rd2; model R = recd2; run; proc corr plots=(matrix); with R logR; var recd recd2 logd d d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 ; run; *LEAVE-ONE-OUT CROSS VALIDATION *Suggested i model logR = logd; *Generate the cross validation data; data cps.cv4; do replicate = 1 to datasize; do rec = 1 to datasize; set cps.Rd2 nobs=datasize point=rec; if rec ^= replicate then new_R=logR; else new_R=.; output; end; end; stop; run; proc print; run; *get predicted values for the missing new_R in each replicate; proc reg data=cps.cv4; model new_R=logd; by replicate; output out=out4a(where=(new_R=.)) predicted=R_hat; run; proc print; run; *and summarize the results; data cps.out4b; set out4a; diff=logR-R_hat; absd=abs(diff); run; title;
  28. 28. Task 2 Code title2 "Residual Plot for Model logR = logd"; symbol1 value=plus colour=red; axis1 label=("logR"); axis2 label=(angle=90 "Residual"); proc gplot data=cps.out4b; plot diff*logR /haxis=axis1 vaxis=axis2; run; proc summary data=cps.out4b; var diff absd; output out=out4c std(diff)=rmse mean(absd)=mae std(absd)=c; run; proc print; run; data out4d; set cps.out4b; diff2=diff**2; mse=0.009292428**2; a=(diff2-mse)**2; run; proc summary data=out4d; var a; output out=out4e sum(a)=b ; run; data out3f; set out3e; seRMSE=((b**0.5)/16)/(2*0.009292428); seMAE=.006464840/4; run; proc print; run; *2(b)i model: R=d d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15; *Generate the cross validation data; data cps.cv2; do replicate = 1 to datasize; do rec = 1 to datasize; set cps.Rd2 nobs=datasize point=rec; if rec ^= replicate then new_R=R; else new_R=.; output; end; end; stop; run; proc print; run; *get predicted values for the missing new_R in each replicate; proc reg data=cps.cv2; model new_R=d d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15; by replicate; output out=out2a(where=(new_R=.)) predicted=R_hat; run; *and summarize the results; data cps.out2b; set out2a; diff=R-R_hat; absd=abs(diff); run; title; title2 "Residual Plot for Model R=d d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15"; symbol1 value=plus colour=red; axis1 label=("R");
  29. 29. Task 2 Code axis2 label=(angle=90 "Residual"); proc gplot data=out2b; plot diff*R /haxis=axis1 vaxis=axis2; run; proc summary data=cps.out2b; var diff absd; output out=out2c std(diff)=rmse mean(absd)=mae std(absd)=c; run; proc print; run; data out2d; set cps.out2b; diff2=diff**2; mse=5358.40**2; a=(diff2-mse)**2; run; proc summary data=out2d; var a; output out=out2e sum(a)=b; run; data out2f; set out2e; seRMSE=((b**0.5)/16)/(2*5358.40); seMAE=5357.98/4; run; proc print; run; *2(b)ii model R = recd2 recd; *Generate the cross validation data; data cps.cv3; do replicate = 1 to datasize; do rec = 1 to datasize; set cps.Rd2 nobs=datasize point=rec; if rec ^= replicate then new_R=R; else new_R=.; output; end; end; stop; run; proc print; run; *get predicted values for the missing new_R in each replicate; proc reg data=cps.cv3; model new_R=recd2 recd; by replicate; output out=out3a(where=(new_R=.)) predicted=R_hat; run; proc print; run; *and summarize the results; data cps.out3b; set out3a; diff=R-R_hat; absd=abs(diff); run; title; title2 "Residual Plot for Model R = recd2 recd"; symbol1 value=plus colour=red; axis1 label=("R"); axis2 label=(angle=90 "Residual"); proc gplot data=cps.out3b;
  30. 30. Task 2 Code plot diff*R /haxis=axis1 vaxis=axis2; run; proc print; run; proc summary data=cps.out3b; var diff absd; output out=out3c std(diff)=rmse mean(absd)=mae std(absd)=c; run; proc print; run; data out3d; set cps.out3b; diff2=diff**2; mse=0.001920634**2; a=(diff2-mse)**2; run; proc summary data=out3d; var a; output out=out3e sum(a)=b ; run; data out3f; set out3e; seRMSE=((b**0.5)/16)/(2*.001920634); seMAE=.001679167/4; run; proc print; run; *Suggested ii model R = recd2; *Generate the cross validation data; data cps.cv5; do replicate = 1 to datasize; do rec = 1 to datasize; set cps.Rd2 nobs=datasize point=rec; if rec ^= replicate then new_R=R; else new_R=.; output; end; end; stop; run; proc print; run; *get predicted values for the missing new_R in each replicate; proc reg data=cps.cv5; model new_R=recd2; by replicate; output out=out5a(where=(new_R=.)) predicted=R_hat; run; proc print; run; *and summarize the results; data cps.out5b; set out5a; diff=R-R_hat; absd=abs(diff); run; title; title2 "Residual Plot for Model R = recd2"; symbol1 value=plus colour=red; axis1 label=("R"); axis2 label=(angle=90 "Residual"); proc gplot data=cps.out5b;
  31. 31. Task 2 Code plot diff*R /haxis=axis1 vaxis=axis2; run; proc print; run; proc summary data=cps.out5b; var diff absd; output out=out5c std(diff)=rmse mean(absd)=mae std(absd)=c; run; proc print; run; data out5d; set cps.out5b; diff2=diff**2; mse=0.001314149**2; a=(diff2-mse)**2; run; proc summary data=out5d; var a; output out=out5e sum(a)=b ; run; data out5f; set out5e; seRMSE=((b**0.5)/16)/(2*0.001314149); seMAE=.001131049/4; run; proc print ; run; *producing a table containing absd from all 4 models to carry out paired wilcoxon signed rank test; PROC SQL; SELECT A.absd, B.absd, C.absd, D.absd FROM cps.out4b AS A, cps.out2b AS B, cps.out3b AS C, cps.out5b AS D WHERE A.replicate=B.replicate=C.replicate=D.replicate; data cps.absd; input model1 model2 model3 model4; datalines; 0.002509 21432.82 0.000178 0.000222 0.000547 8.835317 0.000144 0.000221 0.022544 3.601289 0.000052 0.000269 0.013488 0.548649 0.000112 0.000167 0.009526 0.158066 0.000069 0.000153 0.003606 0.060452 0.000078 0.000232 0.003878 0.043877 0.000043 0.000121 0.003001 0.016238 0.000236 0.000225 0.001622 0.024246 0.000159 0.000046 0.0004 0.023091 0.000266 0.000096 0.008668 0.009097 0.000809 0.00056 0.00284 0.031168 0.000013 0.000341 0.000346 0.043857 0.000134 0.000307 0.016335 0.003074 0.004498 0.004225 0.010083 0.072958 0.003586 0.002621 0.004023 0.278421 0.004755 0.000581 run; data cps.diff; set cps.absd; AB=model1-model2; AC=model1-model3; AD=model1-model4; BC=model2-model3; BD=model2-model4; CD=model3-model4;
  32. 32. Task 2 Code run; proc univariate data = cps.diff; var AB AC AD BC BD CD ; run;
  33. 33. Task 2 Code Appendix C: References 1. Jeff Cartier. The Basics of Creating Graphs with SAS/GRAPH® Software. [online]. Available from: https://support.sas.com/rnd/datavisualization/papers/GraphBasics.pdf [Accessed 24 February 2016] 2. Steven M. LaLonde. 2012. Transforming Variables for Normality and Linearity – When, How, Why and Why Not's. [online]. Available from: http://support.sas.com/resources/papers/proceedings12/430-2012.pdf [13 March 2016] 3. David L. Cassell. 2007. Don't Be Loopy: Re-Sampling and Simulation the SAS® Way. [online]. Available from: http://www2.sas.com/proceedings/forum2007/183-2007.pdf [14 March 2016] 4. Michael J. Wieczkowski. Alternatives to Merging SAS Data Sets … But Be Careful. [online]. Available from: http://www.ats.ucla.edu/stat/sas/library/nesug99/bt150.pdf [23 March 2016].

×