SlideShare a Scribd company logo
1
Annual IL Tornado Count
Katie Ruben
April 22, 2016
As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a
cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In
addition, for this violently rotating column of air to be considered a tornado, the column must
make contact with the ground. When forecasting tornados, meteorologist’s look for four
ingredients in predicting such severe weather. These ingredients are when the “temperature and
wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for
a tornadic thunderstorm to occur[1].”
Tornado’s are an important aspect of living in certain areas of the country. They can cause death,
injury, property damage, and also high anxiety in many people who choose to live in areas prone
to Tornados. In particular, my project will deal with looking at the number of annual tornados
that have occurred in Illinois since 1950. Meteorologists are interested in improving their
understanding of the causes of tornados as well as when they are to occur. The data used during
this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data I
am looking at contains a tornado count from 1950 to 2015 for every state in the United States. I
choose to look strictly at Illinois in this date range. When doing so, I ended up with 2406
tornados over those 65 years. In particular, I am interested in forecasting the number of tornados
that will occur in subsequent years based on the time series data I have found.
In order to analyze the Illinois Tornado Count times series data I will first look to see if the data
is stationary or non-stationary. I will apply the Dickey Fuller test to determine this. Depending
on this outcome, it will help me determine which set of time series models I will want to
continue with. Depending on my original data set, I will want to perform transformations that
reduce large variance as well as take care of any explosive behavior in the data. Upon doing so, I
will then be able to generate preliminary models using ACF, PACF, EACF, and the ARMA
subset. Once several potential models have been chosen, I will fit these models by estimating
parameters using the maximum likelihood method. In addition, perform a residual analysis on
my fitted models and make sure, to the best of my ability, that the models are from the normal
distribution, are independent, and have constant variance. In order to achieve this, I will look at
the KS Test, SW Test, and QQ-plot for normality, runs test and sample autocorrelation function
for independence, and finally the BP Test for constant variance. I will continue building an
appropriate model by looking for outliers and adjusting my models based on residual analysis.
The final step is to perform a forecasting of my data set into the future. I will compare my
forecast with my actual data set to see how accurate my model has become.
[1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from
http://www.spc.noaa.gov/faq/tornado/
[2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from
http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)
2
1   Background
Tornado’s are an important aspect of living in certain areas of the country. They can cause death,
injury, property damage, and also high anxiety in many people who choose to live in areas prone
to Tornados. In particular, this project will deal with looking at the number of annual tornados
that have occurred in Illinois since 1950. Meteorologists are interested in improving their
understanding of the causes of tornados as well as when they are to occur. The data used during
this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data
being investigated contains a tornado count from 1950 to 2015 for every state in the United
States. I choose to look strictly at Illinois in this date range. When doing so, I ended up with
2406 tornados over those 66 years. In particular, I am interested in forecasting the number of
tornados that will occur in subsequent years based on the time series data found.
As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a
cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In
addition, for this violently rotating column of air to be considered a tornado, the column must
make contact with the ground. When forecasting tornados, meteorologist’s look for four
ingredients in predicting such severe weather. These ingredients are when the “temperature and
wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for
a tornadic thunderstorm to occur [1].”
Prior to starting this time series data analysis, we will split our 66 year observations into two
sets; training data and validation data. The training data set will contain 60 years (1950-2009),
while the validation data set will contain 6 years (2010-2015). The validation set contains 9% of
the total observed years. Keep in mind that over these 66 years there were 2,406 tornado
sightings in Illinois.
In this paper, we begin by using our time series data set to perform preliminary transformations
on the training set to ensure stationarity. If the data set shows non-stationary behavior, I will go
through several different transformations in section 2 of this paper. Section 2 will also contain
the model identification process for several time series models as well as estimation and residual
analysis. Since the training data contains 56 observations, the ideal maximum lag recommended
by the autocorrelation of residuals is 𝑘 = ln(56) ≈ 4. This will become important as we work
through this data set. In section 3, we will focus on model validation and choosing which of our
models is most accurate as well forecasting. In section 4 we will focus on a discussion of our
results from this project on IL tornado counts between 1950-2015.
3
2   Training Data Transformations
2.1.1 Training Data
To begin our model building process, we start by examining the training data set. The time series
plot of our Illinois Annual Tornado Count is shown in figure 1. This plot suggests that there is
extremely large variance as well as an explosive behavior being demonstrated as time passes.
This suggests that our time series data set is non-stationary. However, we need to conduct some
formal testing.
Figure 1: Training Data Time Series & Scatter Plot
The Dickey Fuller Test is used in order to determine if the data set is stationary or non-
stationary. The null hypotheses states that 𝛼 = 1 then there is a unit root and the time series in
non-stationary. The alternative hypotheses states that 𝛼 < 1 then the time series is stationary.
If the time series is non-stationary, then it is suggested to take the difference. Throughout this
paper, we will be concerned with a significance level of .05.
Time Series Plot of Annual Tornados in IL
Time
AnnualTornadosinIL
1950 1960 1970 1980 1990 2000 2010
020406080100120
0 20 40 60 80 100 120
020406080100120
Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados
Previous Year Tornado Count
TornadoCountthisYear
4
2.1.2   Transformed Training Data
2.1.2.1 Log(Training Data)
Before, we check for stationarity, we need to try to eliminate the large variance seen in our data.
To do so, we take the natural logarithmic transformation of the training time series data. The plot
is shown in Figure 2. As seen if Figure 2, there still exists large variance, as well as an explosive
behavior. It is suggested now to look at a Box Cox Transformation of the logarithm
transformation to see if we can come up with a better representation for our data set. However,
when using the Box Cox transformation in R, we get that 𝜆 = 0. This suggests that no
transformation is needed.
Figure 2: Natural Logarithm Transformation Training Data Plot
2.1.2.3  Difference on Logarithm of Preliminary Data
The final transformation used to attempt to remove the explosive behavior is to difference the
training data. As seen in Figure 3, the time series plot with this transformation looks much better.
The explosive behavior has dissipated. Looking at the scatter plot of this transformed data in
Figure 4, we see that 𝑌1	
   𝑣 𝑠. 𝑌167	
  shows a negative correlation, 𝑌1	
   𝑣 𝑠. 𝑌168 shows either a slight
negative correlation or no correlation and 𝑌1	
   𝑣 𝑠. 𝑌169 shows no correlation. Investigation of these
plots suggests that we may have a time series model of order 1. We will conduct formal model
selections next.
Time Series Plot of Annual Tornados in IL
Time
log(AnnualTornadosinIL)
1950 1960 1970 1980 1990 2000 2010
1.52.02.53.03.54.04.5
5
Figure 3: Difference Training Data Time Series Plots
Figure 4: Difference Training Data Scatter Plots
There no longer seems to be an apparent explosive behavior in the times series plot when taking
the difference log transform. This suggests stationarity in our transformed training data.
However, a formal Dickey Fuller Test must be applied. In doing so, we get a 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = .01 <
.05 = 𝛼 for lag = 1,2 for non-constant, constant, and linear trends. Therefore, since the p-value is
less than the significance level we reject the null hypothesis and our model is suggested to be
stationary. A sample R output is shown in Appendix A.
Time Series Plot of Annual Tornados in IL Diff(log(t.data))
Time
AnnualTornadosinIL
1950 1960 1970 1980 1990 2000 2010
-1.5-1.0-0.50.00.51.01.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1.5-1.0-0.50.00.51.01.5
Scatterplot of # IL Tornados
Previous Year Tornado Count
TornadoCountthisYear
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1.5-1.0-0.50.00.51.01.5
Scatterplot of # of IL Torndaos
2 Years ago Tornado Count
TornadoCountthisYear
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1.5-1.0-0.50.00.51.01.5
Scatterplot of # of IL Torndaos
3 Years ago Tornado Count
TornadoCountthisYear
6
2.1.3 Preliminary Model Building
Now that we have figured out how to eliminate the large explosive behavior in our data set, we
can begin to look at finding preliminary models to build an appropriate times series model from.
In order to do this, we will look at the ACF plot for potential Moving Average models, PACF
plot for potential Autoregressive models, EACF chart for potential Mixed Process models, and
ARMA subset selections. Figure 5 shows the ACF, PACF and EACF plots for our difference
model chosen above.
Figure 5: ACF, PACF, & EACF
The ACF plot suggests that with our data set, a Moving Average of order 1 may be a potential
model. The PACF plot suggests that a Autoregressive of order 1 may be a potential model. One
aspect to keep in mind is that the PACF should show lags that exponentially decay theoretically.
The PACF plot for this data set does not follow this exponential decaying pattern. Therefore, an
Autoregressive model may not be the best suited model. The EACF plot suggests again an
AR(1) We can also determine the best potential model by looking at the ARMA subset based on
BIC or AIC values. This output is displayed in Figure 6. The maximum number of lags allowed
is 𝑘 = ln(61) ≈ 4 based on the autocorrelation of residuals recommendation from literature on
the topic. This output suggests that the best model for my data would be MA(1) with an intercept
term. This is the same suggestion made by the ACF plot. The second best suggestion would be
an ARMA(1,1) process. Throughout the rest of this project, I will work with the following
processes; ARI(1,1), IMA(1,1), and ARIMA(1,1,1).
5 10 15
-0.4-0.3-0.2-0.10.00.10.2
Series diff(log(t.data))
Lag
ACF
5 10 15
-0.4-0.3-0.2-0.10.00.10.2
Lag
PartialACF
Series diff(log(t.data))
7
Figure 6: ARMA Subset BIC
2.1.3.1 Estimations
Using maximum likelihood estimates, we were able to come up with suitable models for
ARI(1,1), IMA(1,1), and ARIMA(1,1,1). When working with time series models, it is best to
always choose a simple model to explain your data. Figure 7 displays the following estimates for
each of these model. Note that the intercept terms were not significant enough at a level of .05,
so they don’t need to be included in the models. We determined the significance by looking at
the output in R for the estimations and found that when looking at the ratio of the intercept
coefficient to the standard error of that coefficient, the value was close to zero when compared to
the critical value of 1.96. This indicated that the intercept of the coefficient was not significantly
different from zero since the ratio of estimation per standard error was less than 1.96. Note that
my models were estimated using the log data.
ARI(1,1): 𝑌1 = −0.4447 𝑌167 + 𝑒1
IMA(1,1):	
   𝑌1 = 𝑒1 − (−.5491) 𝑒167
ARIMA(1,1,1): 𝑌1 = .3370 𝑌167 + 𝑒1 − (−.8658)(𝑒167)
Figure 7: Model Estimates
2.1.3.1.1 Outliers
Before proceeding further, we must determine if there exist outliers for each of our potential
models. In R, we ran the additive outlier and innovational outlier commands. Both commands in
R, (AO and IO detect), for each model confirmed that there did not exist an outlier in any of the
three models. Therefore, we can continue with our residual analysis.
BIC
(Intercept)
test-lag1
test-lag2
test-lag3
test-lag4
error-lag1
error-lag2
error-lag3
error-lag4
17
14
10
6.9
3.4
0.45
-2.1
-4.7
8
2.1.3.2 Residual Analysis
The next step is to look at the residuals of our three models. From the residuals, we can talk
about normality, constant error variance, and independence. Make note that from the original
training data, there was large variance. Throughout transformations we were able to fix the
stationarity, but so far we assume that the residuals will show there still exists large variance.
Thus, we are also assuming there still exists non-normality and non-independent. However, we
will conduct a formal test on our three models for each of these characteristics.
As seen in Figure 8, the QQ plots do not suggest strong normality. In all three models there
appears to be heavy tails and the QQ normal line does not align along our data points as well as
we would wish. In our opinion, the AR(1) model has the best looking QQ plot to display
normality. To verify this conclusion, we will conduct a KS test and Shapiro Wilks test that can
be found in Appendix A. With a significance level of .05, we must fail to reject the null
hypothesis in every test for normality using the KS and Shaprio Wilks tests for each of our three
models. In each model, the p-value is greater than the significance level. This means that we are
able to assume that our data is from the normal distribution.
Figure 8: QQ Plots of Models
Next we will look at constant error variance. This can be seen in Figure 9. As seen in the three
plots, there appears to be large variances across the horizontal line y=0. However, the plot for
each model does appear to resemble white noise. Thus, we can assume there is possible constant
variance for each model. I was unable to perform a BP or BF test on this data because I did have
the necessary x-variable to regress my residuals on and hence R would not produce these tests
for me.
-2 -1 0 1 2
-1.5-1.0-0.50.00.51.01.5
ARI(1,1) QQ Plot
Theoretical Quantiles
SampleQuantiles
-2 -1 0 1 2
-1.0-0.50.00.51.01.5
IMA(1,1) QQ Plot
Theoretical Quantiles
SampleQuantiles
-2 -1 0 1 2
-1.0-0.50.00.51.0
ARIMA(1,1,1) QQ Plot
Theoretical Quantiles
SampleQuantiles
9
Figure 9: Error Variance Analysis of Models
Finally, we will look to see if our three models are independent. To do this, we will use a Runs
Test for each model. Based on the runs test seen in Appendix A for each model, we conclude that
our transformed data is assumed to be independent because the p-value is greater than the
significance level of .05. Therefore, we fail to reject the null hypothesis where the null
hypothesis states that the data is independent. Another method to test if the data is independently
distributed is to look at the Ljung-Box test. The null hypothesis of this test is that the data is
independently distributed. This means that the Ljung-Box test is testing whether the
autocorrelations of the time series are different from zero or not. Based on the results found in
Appendix A, we can conclude that we fail to reject the null hypothesis. The p-value is greater
than the significance level for each of our three models. Finally, we can confirm independence
once more, we can look at the ACF plot of the residuals for each model as seen in Figure 10.
Since the lags are all within the blue cut off lines, we assume that the residuals resemble white
noise and thus our residuals are independent.
Figure 10: ACF Residuals
ARI(1,1)
Time
Standardizedresiduals
1950 1960 1970 1980 1990 2000 2010
-2-1012
IMA(1,1)
Time
Standardizedresiduals
1950 1960 1970 1980 1990 2000 2010
-1012
ARIMA(1,1,1)
Time
Standardizedresiduals
1950 1960 1970 1980 1990 2000 2010
-1012
5 10 15
-0.2-0.10.00.10.2
Sample ACF of Residuals from ARI(1,1) Model
Lag
ACF
5 10 15
-0.2-0.10.00.10.2
Sample ACF of Residuals from IMA(1,1) Model
Lag
ACF
5 10 15
-0.2-0.10.00.10.2
Sample ACF of Residuals from ARIMA(1,1,1) Model
Lag
ACF
10
Therefore, we have been able to transform our training data into a data set that shows normality,
constant variance and independence. This is normally not an easy task. However, since my data
set only consisted of 66 years it was doable for such a small data set.
3 Model Validation
3.1.1 Confirmation of Models (Over fitting & Parameter Redundancy)
Now we will look at confirming that the three suggested models are good models for our data set
by extending the parameters for each. If the estimate of the additional parameter is not
significantly different from zero and the estimates for the original model do not change
significantly from their original estimates, then we can confirm that our model is a good fit. We
will be concerned with a significance level of .05. For which if the ratio of the estimated
coefficient per its standard deviation is less than the critical value of 1.96, then we will assume
that the coefficient is not significantly different from zero.
Model ARI(1,1) ARI(2,1)
𝝓 𝟐 : s.e.
−. 𝟏𝟏𝟐𝟒
. 𝟏𝟐𝟗𝟖
= −. 𝟖𝟔𝟓𝟗 < 𝟏. 𝟗𝟔	
   ∴ 𝐧𝐨𝐧-­‐‑ 𝐬 𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
𝝓 𝟏 : s.e. -0.4447 0.1155 -0.4970 0.1298
𝝓 𝟐 : s.e. -0.1124 0.1298
𝝓 𝟑 : s.e.
𝝓 𝟒 : s.e.
𝑨𝑰𝑪 125.67 126.93
Model IMA(1,1) IMA(1,2)
𝜽 𝟐 : s.e.
−
. 𝟎𝟒𝟕𝟓
. 𝟏𝟕𝟐𝟖
= −. 𝟐𝟕𝟒𝟖 < 𝟏. 𝟗𝟔	
  
non-𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
𝜽 𝟏 : s.e. -0.5491 0.1358 -0.5527 0.1352
𝜽 𝟐 : s.e. -0.0475 0.1728
𝜽 𝟑 : s.e.
𝜽 𝟒 : s.e.
𝑨𝑰𝑪 124.04 125.97
Model ARIMA(1,1,1) ARIMA(1,1,2) ARIMA(2,1,1)
𝝓 𝟏 : s.e. 0.3370 0.1698 -.9009 0.2298 0.3189 0.1461
𝜽 𝟏 : s.e. -0.8658 0.1027 .3474 0.2712 -0.9114 .0714
𝜽 𝟐 : s.e. -.4464 0.2057
𝝓 𝟐 : s.e. .2174 .1388
𝑨𝑰𝑪 124.97 127.71 124.52
𝜽 𝟐 : s.e.
−. 𝟒𝟒𝟔𝟒
. 𝟐𝟎𝟓𝟕
= −𝟐. 𝟏𝟕𝟎𝟏 > 𝟏. 𝟗𝟔	
  
𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
𝝓 𝟐 : s.e.
. 𝟐𝟏𝟕𝟒
. 𝟏𝟑𝟖𝟖
= 𝟏. 𝟓𝟔𝟔 < 𝟏. 𝟗𝟔	
  
𝐧𝐨𝐧 − 𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
If we continue to increase the order for the ARIMA model, the AIC value will continue to get
larger. Generally, a small AIC value, indicates the best model. It appears that ARIMA(1,1,1) will
be the best model for a mixed process based off of AIC values.
11
In addition, when looking at our three suggested models with the over fitting procedure, it
appears that ARI(1,1) and IMA(1,1) are confirmed to be good models for this time series. With
ARIMA(1,1,1), we were unable to confirm this process as a good model due to ARIMA(1,1,2)
having a significant coefficient for 𝜃8. In addition, the ARIMA estimates for the over fitting were
not close to the original estimates for this model. Therefore, once again, we can cannot confirm
this model as a good fit for our data.
Therefore, as we continue with forecasting, I will be looking at the models ARI(1,1) and
IMA(1,1) for my log data set.
3.1.2 Forecasting
The final procedure is to identify which of the models remaining would be the best predictor of
annual tornados. We will be forecasting values for my testing data set. Recall, that we initially
pulled out 6 years from the end of our data. Now we can test if our model is accurate. As seen in
figure 11, we were able to make our predictions using R, which are displayed with the red dots.
These predictions were made by using the one step forward forecasting procedure discussed in
class. As you can tell, it is not perfect. This is partly due to the fact that this data set only
contains 66 data points and these 66 points may need to be modeled using a different technique.
However, we are using the techniques demonstrated in class.
Figure 11: Log data Forecast
In order to determine which model has better prediction capabilities, we will look at MSE, MAP,
and PMAD. The smaller the values, the better the prediction abilities. Therefore, by looking at
the table below we can determine that IMA(1,1) would have the best predicting capabilities
based on this recommendation.
Forecasting with ARI(1,1)
Time
log(t.data)
0 10 20 30 40 50 60
1.52.02.53.03.54.04.5
Forecasting with IMA(1,1)
Time
log(t.data)
0 10 20 30 40 50 60
1.52.02.53.03.54.04.5
12
ARI(1,1) IMA(1,1)
MSE 0.0434644 MSE 0.03533288
MAP 0.0450561 MAP 0.03748878
PMAD 0.04383421 PMAD 0.03675122
Smaller Values
As you can see, the values are not as small as we would like. However, with the data at hand, this
seems to be pretty good. All coding used to perform my predictions are seen Appendix A.
Now that we have chosen IMA(1,1) as our best model, we will transform the data back to its
original meaning. The data in Figure 12 resembles the predictions made when working on the log
data. The data in Figure 13, resembles the predictions made when transforming the data back to
its original meaning. Both figures also contain the prediction intervals for the corresponding data
sets.
Figure 12: Log data Prediction Values & Interval
Figure 13: Original data Prediction Values & Interval
Figure 13 shows that three of our 6 values were predicted fairly close to the actually value.
However, there is a lot of variability still in our model when predicting. When looking at the
prediction intervals for our original data set after transforming back, we see that they are very
large. This means that the predicting capability of IMA(1,1) is not extremely good. The
complete list of the 95% prediction intervals for the entire original data set is seen in Appendix
A. Figure 14 displays the final graph our time series data that compares the original data to the 6
predicted values.
13
Figure 14: Time Series Plot with Predictions (Original Data)
IMA(1,1):	
   𝑌1 = 𝑒1 − (−.5491) 𝑒167
4 Discussion
The goal of this analysis was to use knowledge of Time Series models to predict the future count
of tornados in IL every year. We started with 2,406 tornado sightings in Illinois from 1950 to
2015. However, we broke our data into yearly data. Therefore, we had 66 data points. This was a
fairly small data set to perform a Time Series analysis on. Keep this in mind as we continue to
discuss our results.
We began our analysis with a training data set. This training data set contained the first 60 years
of our analysis. We set aside the last 6 years for our testing data set which became extremely
important when we performed our forecasting’s. Our first goal was to ensure that our data was
stationary. In order to do this, we had to perform the log transformation as well as take the
difference of our data. The log transformation helped to reduce the variability seen in our
original data set while the differencing allowed us to remove the explosive behavior seen in the
original time series plot. We were able to ensure stationarity of our data set by performing the
Dickey Fuller Test.
Once we had a stationary data set, we were able to begin the estimation process. We looked at
the ACF, PACF, EACF, and best subset selection chart in order to determine which models
0 10 20 30 40 50 60
020406080100120
Time Series Plot of Annual Tornados in IL
1950 - 2015
AnnualTornadosinIL
14
would be best. We came to the conclusion that an ARI(1,1), IMA(1,1) and ARIMA(1,1,1) would
all be suitable models at this point. Next, we performed a residual analysis of all three models.
As discussed in the paper, all three models were shown to have normality, constant error
variance, and were independent. This is primarily due to the small sample size of our data set.
When a data set is small or extremely large, these three characteristics are a lot easier to achieve.
However, when we performed an over fitting on all three of these models, the ARIMA(1,1,1)
was proved to be non-sufficient for this data set. Therefore, as we continued forward with our
project, we focused only on the ARI(1,1) and IMA(1,1).
Finally, we forecasted values for our testing data set using both ARI(1,1) and IMA(1,1). In doing
so, we calculated the MSE, MAP, and PMAD for each of the models. We found that the
IMA(1,1) model had the smallest numerical value in all three of these tests. This meant that for
our data set, IMA(1,1) was the best model. However, make note that the values for these
criterion are not as small as we would have wished. The smaller the value, the better the
predictions. As seen in the final time series plot shown in figure 14, our predictions are far from
perfect. In all cases, it seems that our predictions are being overestimated from the actual values.
In some cases, for example in 2012, this overestimation is drastic.
In order to further improve our models, we may need to try other time series models than those
that were discussed in class. In addition, to better predict tornados in Illinois we may have
wanted to break down our data set into quarters of the year. Clearly tornados are more frequent
in the spring and summer months. In using a different division of time for this data set, we would
have had a larger number of data points for which we could create different time series models
from. When looking at the data set we did use for this project, the large variance over time in the
count of tornados could be due to the number of people who are actually out in Illinois counting
them. In the early years of this data set, tornado counts may be skewed down as people may not
have been tracking them as much as we do in 2015. In addition, the number of tornados
increasing over time could be due to global warming or environmental effects.
In the end, this analysis shows that the model chosen to represent this data was relevant, but
could have been better. As stated before, the predictions were continually overestimated. In the
future, we would like to go back and test other potential models that were not discussed in this
course in order to better predict the annual tornado count in Illinois.
15
Appendix
A Reference for Model Building
A.1 Training Data Transformation Codes
* All codes used for this project are appended at the very end of this paper.
A.2 Model Selection
Dickey Fuller Test on Diff(log(t.data)
16
Model Estimations
A.2.1 Residual Analysis Codes
Normality Test:
Hb: data	
  is	
  normal
Hl: data	
  is	
  not	
  normal
Since	
  the	
  p − value	
  in	
  all	
  three	
  cases	
  is >
.05.	
  	
  	
  	
  Therefore, fail	
  to	
  reject	
  Hb.
17
Independence Test:
Hb: data	
  is	
  Independent
Hl: data	
  is	
  not	
  Independent
Since	
  the	
  p − value	
  in	
  all	
  three	
  cases	
  is >
.05.	
  	
  	
  	
  Therefore, fail	
  to	
  reject	
  Hb.
Ljung-Box Test:
Hb: data	
  is	
  independent	
   𝑟7 = 𝑟8 = ⋯ = 𝑟{ = 0
Hl: data	
  is	
  not	
  independent
Since	
  the	
  p − value	
  in	
  all	
  three	
  cases	
  is > .05.	
  	
  	
  	
  Therefore, fail	
  to	
  reject	
  Hb.
18
A.3 Model Validation
Over fitting Models
19
A.4 Forecasting
20
Code:
### Illinois Tornado Annual Count Updated (Left out 6 data points)
library(TSA)
library(fUnitRoots)
data1<-read.csv(file="IL Total Data.csv",header=FALSE,sep=",")
x1<-data1[,2]
y1<-data1[,1]
y_train<-y1[1:60]
y_test<-y1[61:66]
t.data<-ts(y_train,freq=1,start=c(1950,1))
t.data1<-ts(y_test,freq=1,start=c(2010,1))
k.data<-ts(y1,freq=1,start=c(1950,1))
#Original Time Series Plot
plot(t.data,ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual Tornados in
IL',type='o')
plot(y=t.data,x=zlag(t.data),ylab='Tornado Count this Year',xlab='Previous Year Tornado
Count',main='Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados')
#log Transform
plot(log(t.data),ylab='log(Annual Tornados in IL)',xlab='Time',main='Time Series Plot of Annual
Tornados in IL',type='o')
acf(log(t.data))
pacf(log(t.data))
eacf(log(t.data))
AR/MA
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x o o o o o o o o o o o o
1 x o o o o o o o o o o o o o
2 o o o o o o o o o o o o o o
3 x o o o o o o o o o o o o o
4 o o o o o o o o o o o o o o
5 x o o o o o o o o o o o o o
6 x o o o o o o o o o o o o o
7 o o o o o o o o o o o o o o
#First Difference Log
plot(diff(log(t.data)),ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual
Tornados in IL Diff(log(t.data))',type='o')
#Test for Stationarity
adfTest(diff(log(t.data))),lags=1,type=c("nc"))
adfTest(diff(log(t.data)),lags=1,type = c("c"))
adfTest(diff(log(t.data)),lags = 1, type = c("ct"))
21
#Model Building
acf(diff(log(t.data))) #Suggests MA(1)
pacf(diff(log(t.data))) #Suggests AR(1)
eacf(diff(log(t.data))) # suggests AR(1)
AR/MA
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x o o o o o o o o o o o o o
1 o o o o o o o o o o o o o o
2 x o o o o o o o o o o o o o
3 x o o o o o o o o o o o o o
4 x o o o o o o o o o o o o o
5 x x o o o o o o o o o o o o
6 x o o o o o o o o o o o o o
7 o o x x x o o o o o o o o o
#Best Subset suggests MA(1) as best, then ARMA(1,1)
sub1<-armasubsets(diff(log(t.data)),nar=4,nma=4,y.name='test',
ar.method='ols')
plot(sub1)
#Scatter PLot Comparison
par(mfrow = c(1, 3),pty = "s")
plot(y=diff(log(t.data)),x=zlag(diff(log(t.data))),ylab='Tornado Count this Year',xlab='Previous
Year Tornado Count',main='Scatterplot of # IL Tornados')
abline(0,0)
plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=2),ylab='Tornado Count this Year',xlab='2
Years ago Tornado Count',main='Scatterplot of # of IL Torndaos')
abline(0,0)
plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=3),ylab='Tornado Count this Year',xlab='3
Years ago Tornado Count',main='Scatterplot of # of IL Torndaos')
abline(0,0)
#Fitting Models
AR1<-arima(log(t.data), order = c(1, 1, 0),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
MA1<-arima(log(t.data), order = c(0, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
ARMA11<-arima(log(t.data), order = c(1, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL,
method = 'ML')
#No outliers were detected.
detectAO(ARMA11)
detectIO(ARMA11)
#Resdiual Anaysis
tsdiag(AR1,gof=4,omit.initial=F)
tsdiag(MA1,gof=4,omit.initial=F)
tsdiag(ARMA11,gof=4,omit.initial=F)
#Normality
op <- par(mfrow = c(1, 3),pty = "s")
qqnorm(residuals(AR1),main='ARI(1,1) QQ Plot')
qqline(residuals(AR1),col='red')
qqnorm(residuals(MA1),main='IMA(1,1) QQ Plot')
qqline(residuals(MA1),col='red')
qqnorm(residuals(ARMA11),main='ARIMA(1,1,1) QQ Plot')
qqline(residuals(ARMA11),col='red')
22
#Formal Testing
ks.test(residuals(AR1),"pnorm")
shapiro.test(residuals(AR1))
ks.test(residuals(MA1),"pnorm")
shapiro.test(residuals(MA1))
ks.test(residuals(ARMA11),"pnorm")
shapiro.test(residuals(ARMA11))
#Constant Variance
op <- par(mfrow = c(1, 3),pty = "s")
plot(rstandard(AR1),ylab='Standardized residuals',main='ARI(1,1)',type='o')
abline(0,0,col="red",lwd=2)
plot(rstandard(MA1),ylab='Standardized residuals',main='IMA(1,1)',type='o')
abline(0,0,col="red",lwd=2)
plot(rstandard(ARMA11),ylab='Standardized residuals',main='ARIMA(1,1,1)',type='o')
abline(0,0,col="red",lwd=2)
#Independence
#ACF Plot
op <- par(mfrow = c(1, 3),pty = "s")
acf(residuals(AR1),main='Sample ACF of Residuals from ARI(1,1) Model')
acf(residuals(MA1),main='Sample ACF of Residuals from IMA(1,1) Model')
acf(residuals(ARMA11),main='Sample ACF of Residuals from ARIMA(1,1,1) Model')
#Lijung
Box.test(residuals(AR1),lag=4, type="Ljung-Box",fitdf=1)
Box.test(residuals(MA1),lag=4, type="Ljung-Box",fitdf=1)
Box.test(residuals(ARMA11),lag=4, type="Ljung-Box",fitdf=2)
# Runs
runs(residuals(AR1))
runs(residuals(MA1))
runs(residuals(ARMA11))
#Over fitting Parameter Redundancy
AR2<-arima(log(t.data), order = c(2, 1, 0),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
MA2<-arima(log(t.data), order = c(0, 1, 2),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
ARMA12<-arima(log(t.data), order = c(1, 1, 2),xreg = NULL, include.mean = TRUE,init = NULL,
method = 'ML')
ARMA21<-arima(log(t.data), order = c(2, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL,
method = 'ML')
#Predictions/Forecasting
#ARI(1,1) Predictions
pred1<-rep(NA,6)
for(i in 1:6) {y_train<-y1[1:(61+i-1)]
est_1<-arima(log(y_train),order=c(1,1,0),method='ML')
pred1[i]<-predict(est_1,n.ahead=1)$pred}
t.pred1<-ts(pred1,freq=1,start=c(2010,1))
t.pred1
Time Series:
Start = 2010
End = 2015
Frequency = 1
[1] 3.926734 4.112738 3.839367 3.759854 3.944391 4.077988
23
log(y_test)
[1] 3.891820 4.290459 3.465736 4.007333 3.891820 4.234107
plot(ts(log(y1)),type="o",main="Forecasting with ARI(1,1)",ylab="log(t.data)")
points(ts(pred1,start=c(61),frequency=1),col="red")
MSE=mean((log(y_test)-pred1)^2)
MAP=mean(abs((log(y_test)-pred1)/(log(y_test))))
PMAD=sum(abs(log(y_test)-pred1))/sum(abs(log(y_test)))
#IMA(1,1) Predictions
pred1<-rep(NA,6)
for(i in 1:6) {y_train<-y1[1:(61+i-1)]
est_1<-arima(log(y_train),order=c(0,1,1),method='ML')
pred1[i]<-predict(est_1,n.ahead=1)$pred}
pred4<-rep(NA,6)
for(i in 1:6) {y_train<-y1[1:(61+i-1)]
est_1<-arima(log(y_train),order=c(1,1,0),method='ML')
pred4[i]<-predict(est_1,n.ahead=1)$se}
t.pred2<-ts(pred1,freq=1,start=c(2010,1))
t.pred2
Time Series:
Time Series:
Start = 2010
End = 2015
Frequency = 1
[1] 3.884078 4.067417 3.799722 3.891222 3.891483 4.041335
log(y_test)
[1] 3.891820 4.290459 3.465736 4.007333 3.891820 4.234107
plot(ts(log(y1)),type="o",main="Forecasting with IMA(1,1)",ylab="log(t.data)")
points(ts(pred1,start=c(61),frequency=1),col="red")
MSE=mean((log(y_test)-pred1)^2)
MAP=mean(abs((log(y_test)-pred1)/(log(y_test))))
PMAD=sum(abs(log(y_test)-pred1))/sum(abs(log(y_test)))
#Based off of our predictions for ARI(1,1), IMA(1,1); IMA(1,1) was the best model.
#Prediction Intervals for IMA(1,1)
lower<-pred1-qnorm(0.975,0,1)*pred4
upper<-pred1+qnorm(0.975,0,1)*pred4
data.frame(Year=c(2010:2015),lower,upper)
#Transform Model Back
y_test
[1] 49 73 32 55 49 69
kk<-exp(pred1 + (1/2)*(pred4)^2)
[1] 61.39905 73.55173 56.25562 61.43353 61.24097 70.94283
24
#100(1-alpha)% prediction intervals
# Create lower and upper prediction interval bounds
lower1<-pred1-qnorm(0.975,0,1)*pred4
upper1<-pred1+qnorm(0.975,0,1)*pred4
data.frame(Years=c(1:66),lower1,upper1)
#Original 95% Prediction Intervals
data.frame(Years=c(1:66),exp(lower1),exp(upper1))
Years exp.lower1. exp.upper1.
1 1 12.74597 185.4787
2 2 15.43213 221.0484
3 3 11.82101 168.9437
4 4 13.08390 183.2885
5 5 13.21798 181.5241
6 6 15.48158 209.1434
7 7 12.74597 185.4787
8 8 15.43213 221.0484
9 9 11.82101 168.9437
10 10 13.08390 183.2885
11 11 13.21798 181.5241
12 12 15.48158 209.1434
13 13 12.74597 185.4787
14 14 15.43213 221.0484
15 15 11.82101 168.9437
16 16 13.08390 183.2885
17 17 13.21798 181.5241
18 18 15.48158 209.1434
19 19 12.74597 185.4787
20 20 15.43213 221.0484
21 21 11.82101 168.9437
22 22 13.08390 183.2885
23 23 13.21798 181.5241
24 24 15.48158 209.1434
25 25 12.74597 185.4787
26 26 15.43213 221.0484
27 27 11.82101 168.9437
28 28 13.08390 183.2885
29 29 13.21798 181.5241
30 30 15.48158 209.1434
31 31 12.74597 185.4787
32 32 15.43213 221.0484
33 33 11.82101 168.9437
34 34 13.08390 183.2885
35 35 13.21798 181.5241
36 36 15.48158 209.1434
37 37 12.74597 185.4787
38 38 15.43213 221.0484
39 39 11.82101 168.9437
40 40 13.08390 183.2885
41 41 13.21798 181.5241
42 42 15.48158 209.1434
43 43 12.74597 185.4787
44 44 15.43213 221.0484
45 45 11.82101 168.9437
46 46 13.08390 183.2885
47 47 13.21798 181.5241
48 48 15.48158 209.1434
49 49 12.74597 185.4787
50 50 15.43213 221.0484
51 51 11.82101 168.9437
52 52 13.08390 183.2885
25
53 53 13.21798 181.5241
54 54 15.48158 209.1434
55 55 12.74597 185.4787
56 56 15.43213 221.0484
57 57 11.82101 168.9437
58 58 13.08390 183.2885
59 59 13.21798 181.5241
60 60 15.48158 209.1434
61 61 12.74597 185.4787
62 62 15.43213 221.0484
63 63 11.82101 168.9437
64 64 13.08390 183.2885
65 65 13.21798 181.5241
66 66 15.48158 209.1434
#Convert back to Original TS PLOT IMA(1,1)
plot(y1,ylab='Annual Tornados in IL',xlab='1950 - 2015',main='Time Series Plot of Annual
Tornados in IL',type='o')
points(ts(kk,start=c(61),frequency=1),col="red",type='o')
26
References
[1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from
http://www.spc.noaa.gov/faq/tornado/
[2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from
http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)

More Related Content

Similar to NEW Time Series Paper

Forecasting_CO2_Emissions.pptx
Forecasting_CO2_Emissions.pptxForecasting_CO2_Emissions.pptx
Forecasting_CO2_Emissions.pptx
MOINDALVS
 
Running head accident and casual factor events1accident and ca.docx
Running head accident and casual factor events1accident and ca.docxRunning head accident and casual factor events1accident and ca.docx
Running head accident and casual factor events1accident and ca.docx
SUBHI7
 
vatter_wu_chavez_yu_2014
vatter_wu_chavez_yu_2014vatter_wu_chavez_yu_2014
vatter_wu_chavez_yu_2014Thibault Vatter
 
Storm Prediction data analysis using R/SAS
Storm Prediction data analysis using R/SASStorm Prediction data analysis using R/SAS
Storm Prediction data analysis using R/SAS
Gautam Sawant
 
Final Time series analysis part 2. pptx
Final Time series analysis part 2.  pptxFinal Time series analysis part 2.  pptx
Final Time series analysis part 2. pptx
SHUBHAMMBA3
 
FA_Unit 4 (Time Series Analysis E-views).pdf
FA_Unit 4 (Time Series Analysis E-views).pdfFA_Unit 4 (Time Series Analysis E-views).pdf
FA_Unit 4 (Time Series Analysis E-views).pdf
SOUMYASHARMA909224
 
Module 3 - Time Series.pptx
Module 3 - Time Series.pptxModule 3 - Time Series.pptx
Module 3 - Time Series.pptx
nikshaikh786
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
Arun Kejariwal
 
Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...
Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...
Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...
Bayu imadul Bilad
 
Application of panel data to the effect of five (5) world development indicat...
Application of panel data to the effect of five (5) world development indicat...Application of panel data to the effect of five (5) world development indicat...
Application of panel data to the effect of five (5) world development indicat...Alexander Decker
 
Application of panel data to the effect of five (5) world development indicat...
Application of panel data to the effect of five (5) world development indicat...Application of panel data to the effect of five (5) world development indicat...
Application of panel data to the effect of five (5) world development indicat...Alexander Decker
 
Time series
Time seriesTime series
Dinosaur Extinction Essay
Dinosaur Extinction EssayDinosaur Extinction Essay
Dinosaur Extinction Essay
Kristi Anderson
 
Applied Statistics Chapter 2 Time series (1).ppt
Applied Statistics Chapter 2 Time series (1).pptApplied Statistics Chapter 2 Time series (1).ppt
Applied Statistics Chapter 2 Time series (1).ppt
swamyvivekp
 
Analysis of Taylor Rule Deviations
Analysis of Taylor Rule DeviationsAnalysis of Taylor Rule Deviations
Analysis of Taylor Rule DeviationsCheng-Che Hsu
 
Report stella simulator
Report stella simulatorReport stella simulator
Report stella simulator
syamimiauni18
 
Option pricing under multiscale stochastic volatility
Option pricing under multiscale stochastic volatilityOption pricing under multiscale stochastic volatility
Option pricing under multiscale stochastic volatility
FGV Brazil
 
How the information content of your contact pattern representation affects pr...
How the information content of your contact pattern representation affects pr...How the information content of your contact pattern representation affects pr...
How the information content of your contact pattern representation affects pr...
Petter Holme
 

Similar to NEW Time Series Paper (20)

Forecasting_CO2_Emissions.pptx
Forecasting_CO2_Emissions.pptxForecasting_CO2_Emissions.pptx
Forecasting_CO2_Emissions.pptx
 
Running head accident and casual factor events1accident and ca.docx
Running head accident and casual factor events1accident and ca.docxRunning head accident and casual factor events1accident and ca.docx
Running head accident and casual factor events1accident and ca.docx
 
vatter_wu_chavez_yu_2014
vatter_wu_chavez_yu_2014vatter_wu_chavez_yu_2014
vatter_wu_chavez_yu_2014
 
Storm Prediction data analysis using R/SAS
Storm Prediction data analysis using R/SASStorm Prediction data analysis using R/SAS
Storm Prediction data analysis using R/SAS
 
Final Time series analysis part 2. pptx
Final Time series analysis part 2.  pptxFinal Time series analysis part 2.  pptx
Final Time series analysis part 2. pptx
 
FA_Unit 4 (Time Series Analysis E-views).pdf
FA_Unit 4 (Time Series Analysis E-views).pdfFA_Unit 4 (Time Series Analysis E-views).pdf
FA_Unit 4 (Time Series Analysis E-views).pdf
 
Module 3 - Time Series.pptx
Module 3 - Time Series.pptxModule 3 - Time Series.pptx
Module 3 - Time Series.pptx
 
Event_studies
Event_studiesEvent_studies
Event_studies
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...
Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...
Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...
 
Application of panel data to the effect of five (5) world development indicat...
Application of panel data to the effect of five (5) world development indicat...Application of panel data to the effect of five (5) world development indicat...
Application of panel data to the effect of five (5) world development indicat...
 
Application of panel data to the effect of five (5) world development indicat...
Application of panel data to the effect of five (5) world development indicat...Application of panel data to the effect of five (5) world development indicat...
Application of panel data to the effect of five (5) world development indicat...
 
Time series
Time seriesTime series
Time series
 
Dinosaur Extinction Essay
Dinosaur Extinction EssayDinosaur Extinction Essay
Dinosaur Extinction Essay
 
Applied Statistics Chapter 2 Time series (1).ppt
Applied Statistics Chapter 2 Time series (1).pptApplied Statistics Chapter 2 Time series (1).ppt
Applied Statistics Chapter 2 Time series (1).ppt
 
Analysis of Taylor Rule Deviations
Analysis of Taylor Rule DeviationsAnalysis of Taylor Rule Deviations
Analysis of Taylor Rule Deviations
 
Report stella simulator
Report stella simulatorReport stella simulator
Report stella simulator
 
Forecasting techniques
Forecasting techniquesForecasting techniques
Forecasting techniques
 
Option pricing under multiscale stochastic volatility
Option pricing under multiscale stochastic volatilityOption pricing under multiscale stochastic volatility
Option pricing under multiscale stochastic volatility
 
How the information content of your contact pattern representation affects pr...
How the information content of your contact pattern representation affects pr...How the information content of your contact pattern representation affects pr...
How the information content of your contact pattern representation affects pr...
 

NEW Time Series Paper

  • 1. 1 Annual IL Tornado Count Katie Ruben April 22, 2016 As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In addition, for this violently rotating column of air to be considered a tornado, the column must make contact with the ground. When forecasting tornados, meteorologist’s look for four ingredients in predicting such severe weather. These ingredients are when the “temperature and wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for a tornadic thunderstorm to occur[1].” Tornado’s are an important aspect of living in certain areas of the country. They can cause death, injury, property damage, and also high anxiety in many people who choose to live in areas prone to Tornados. In particular, my project will deal with looking at the number of annual tornados that have occurred in Illinois since 1950. Meteorologists are interested in improving their understanding of the causes of tornados as well as when they are to occur. The data used during this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data I am looking at contains a tornado count from 1950 to 2015 for every state in the United States. I choose to look strictly at Illinois in this date range. When doing so, I ended up with 2406 tornados over those 65 years. In particular, I am interested in forecasting the number of tornados that will occur in subsequent years based on the time series data I have found. In order to analyze the Illinois Tornado Count times series data I will first look to see if the data is stationary or non-stationary. I will apply the Dickey Fuller test to determine this. Depending on this outcome, it will help me determine which set of time series models I will want to continue with. Depending on my original data set, I will want to perform transformations that reduce large variance as well as take care of any explosive behavior in the data. Upon doing so, I will then be able to generate preliminary models using ACF, PACF, EACF, and the ARMA subset. Once several potential models have been chosen, I will fit these models by estimating parameters using the maximum likelihood method. In addition, perform a residual analysis on my fitted models and make sure, to the best of my ability, that the models are from the normal distribution, are independent, and have constant variance. In order to achieve this, I will look at the KS Test, SW Test, and QQ-plot for normality, runs test and sample autocorrelation function for independence, and finally the BP Test for constant variance. I will continue building an appropriate model by looking for outliers and adjusting my models based on residual analysis. The final step is to perform a forecasting of my data set into the future. I will compare my forecast with my actual data set to see how accurate my model has become. [1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from http://www.spc.noaa.gov/faq/tornado/ [2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)
  • 2. 2 1   Background Tornado’s are an important aspect of living in certain areas of the country. They can cause death, injury, property damage, and also high anxiety in many people who choose to live in areas prone to Tornados. In particular, this project will deal with looking at the number of annual tornados that have occurred in Illinois since 1950. Meteorologists are interested in improving their understanding of the causes of tornados as well as when they are to occur. The data used during this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data being investigated contains a tornado count from 1950 to 2015 for every state in the United States. I choose to look strictly at Illinois in this date range. When doing so, I ended up with 2406 tornados over those 66 years. In particular, I am interested in forecasting the number of tornados that will occur in subsequent years based on the time series data found. As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In addition, for this violently rotating column of air to be considered a tornado, the column must make contact with the ground. When forecasting tornados, meteorologist’s look for four ingredients in predicting such severe weather. These ingredients are when the “temperature and wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for a tornadic thunderstorm to occur [1].” Prior to starting this time series data analysis, we will split our 66 year observations into two sets; training data and validation data. The training data set will contain 60 years (1950-2009), while the validation data set will contain 6 years (2010-2015). The validation set contains 9% of the total observed years. Keep in mind that over these 66 years there were 2,406 tornado sightings in Illinois. In this paper, we begin by using our time series data set to perform preliminary transformations on the training set to ensure stationarity. If the data set shows non-stationary behavior, I will go through several different transformations in section 2 of this paper. Section 2 will also contain the model identification process for several time series models as well as estimation and residual analysis. Since the training data contains 56 observations, the ideal maximum lag recommended by the autocorrelation of residuals is 𝑘 = ln(56) ≈ 4. This will become important as we work through this data set. In section 3, we will focus on model validation and choosing which of our models is most accurate as well forecasting. In section 4 we will focus on a discussion of our results from this project on IL tornado counts between 1950-2015.
  • 3. 3 2   Training Data Transformations 2.1.1 Training Data To begin our model building process, we start by examining the training data set. The time series plot of our Illinois Annual Tornado Count is shown in figure 1. This plot suggests that there is extremely large variance as well as an explosive behavior being demonstrated as time passes. This suggests that our time series data set is non-stationary. However, we need to conduct some formal testing. Figure 1: Training Data Time Series & Scatter Plot The Dickey Fuller Test is used in order to determine if the data set is stationary or non- stationary. The null hypotheses states that 𝛼 = 1 then there is a unit root and the time series in non-stationary. The alternative hypotheses states that 𝛼 < 1 then the time series is stationary. If the time series is non-stationary, then it is suggested to take the difference. Throughout this paper, we will be concerned with a significance level of .05. Time Series Plot of Annual Tornados in IL Time AnnualTornadosinIL 1950 1960 1970 1980 1990 2000 2010 020406080100120 0 20 40 60 80 100 120 020406080100120 Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados Previous Year Tornado Count TornadoCountthisYear
  • 4. 4 2.1.2   Transformed Training Data 2.1.2.1 Log(Training Data) Before, we check for stationarity, we need to try to eliminate the large variance seen in our data. To do so, we take the natural logarithmic transformation of the training time series data. The plot is shown in Figure 2. As seen if Figure 2, there still exists large variance, as well as an explosive behavior. It is suggested now to look at a Box Cox Transformation of the logarithm transformation to see if we can come up with a better representation for our data set. However, when using the Box Cox transformation in R, we get that 𝜆 = 0. This suggests that no transformation is needed. Figure 2: Natural Logarithm Transformation Training Data Plot 2.1.2.3  Difference on Logarithm of Preliminary Data The final transformation used to attempt to remove the explosive behavior is to difference the training data. As seen in Figure 3, the time series plot with this transformation looks much better. The explosive behavior has dissipated. Looking at the scatter plot of this transformed data in Figure 4, we see that 𝑌1   𝑣 𝑠. 𝑌167  shows a negative correlation, 𝑌1   𝑣 𝑠. 𝑌168 shows either a slight negative correlation or no correlation and 𝑌1   𝑣 𝑠. 𝑌169 shows no correlation. Investigation of these plots suggests that we may have a time series model of order 1. We will conduct formal model selections next. Time Series Plot of Annual Tornados in IL Time log(AnnualTornadosinIL) 1950 1960 1970 1980 1990 2000 2010 1.52.02.53.03.54.04.5
  • 5. 5 Figure 3: Difference Training Data Time Series Plots Figure 4: Difference Training Data Scatter Plots There no longer seems to be an apparent explosive behavior in the times series plot when taking the difference log transform. This suggests stationarity in our transformed training data. However, a formal Dickey Fuller Test must be applied. In doing so, we get a 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = .01 < .05 = 𝛼 for lag = 1,2 for non-constant, constant, and linear trends. Therefore, since the p-value is less than the significance level we reject the null hypothesis and our model is suggested to be stationary. A sample R output is shown in Appendix A. Time Series Plot of Annual Tornados in IL Diff(log(t.data)) Time AnnualTornadosinIL 1950 1960 1970 1980 1990 2000 2010 -1.5-1.0-0.50.00.51.01.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.5-1.0-0.50.00.51.01.5 Scatterplot of # IL Tornados Previous Year Tornado Count TornadoCountthisYear -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.5-1.0-0.50.00.51.01.5 Scatterplot of # of IL Torndaos 2 Years ago Tornado Count TornadoCountthisYear -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 -1.5-1.0-0.50.00.51.01.5 Scatterplot of # of IL Torndaos 3 Years ago Tornado Count TornadoCountthisYear
  • 6. 6 2.1.3 Preliminary Model Building Now that we have figured out how to eliminate the large explosive behavior in our data set, we can begin to look at finding preliminary models to build an appropriate times series model from. In order to do this, we will look at the ACF plot for potential Moving Average models, PACF plot for potential Autoregressive models, EACF chart for potential Mixed Process models, and ARMA subset selections. Figure 5 shows the ACF, PACF and EACF plots for our difference model chosen above. Figure 5: ACF, PACF, & EACF The ACF plot suggests that with our data set, a Moving Average of order 1 may be a potential model. The PACF plot suggests that a Autoregressive of order 1 may be a potential model. One aspect to keep in mind is that the PACF should show lags that exponentially decay theoretically. The PACF plot for this data set does not follow this exponential decaying pattern. Therefore, an Autoregressive model may not be the best suited model. The EACF plot suggests again an AR(1) We can also determine the best potential model by looking at the ARMA subset based on BIC or AIC values. This output is displayed in Figure 6. The maximum number of lags allowed is 𝑘 = ln(61) ≈ 4 based on the autocorrelation of residuals recommendation from literature on the topic. This output suggests that the best model for my data would be MA(1) with an intercept term. This is the same suggestion made by the ACF plot. The second best suggestion would be an ARMA(1,1) process. Throughout the rest of this project, I will work with the following processes; ARI(1,1), IMA(1,1), and ARIMA(1,1,1). 5 10 15 -0.4-0.3-0.2-0.10.00.10.2 Series diff(log(t.data)) Lag ACF 5 10 15 -0.4-0.3-0.2-0.10.00.10.2 Lag PartialACF Series diff(log(t.data))
  • 7. 7 Figure 6: ARMA Subset BIC 2.1.3.1 Estimations Using maximum likelihood estimates, we were able to come up with suitable models for ARI(1,1), IMA(1,1), and ARIMA(1,1,1). When working with time series models, it is best to always choose a simple model to explain your data. Figure 7 displays the following estimates for each of these model. Note that the intercept terms were not significant enough at a level of .05, so they don’t need to be included in the models. We determined the significance by looking at the output in R for the estimations and found that when looking at the ratio of the intercept coefficient to the standard error of that coefficient, the value was close to zero when compared to the critical value of 1.96. This indicated that the intercept of the coefficient was not significantly different from zero since the ratio of estimation per standard error was less than 1.96. Note that my models were estimated using the log data. ARI(1,1): 𝑌1 = −0.4447 𝑌167 + 𝑒1 IMA(1,1):   𝑌1 = 𝑒1 − (−.5491) 𝑒167 ARIMA(1,1,1): 𝑌1 = .3370 𝑌167 + 𝑒1 − (−.8658)(𝑒167) Figure 7: Model Estimates 2.1.3.1.1 Outliers Before proceeding further, we must determine if there exist outliers for each of our potential models. In R, we ran the additive outlier and innovational outlier commands. Both commands in R, (AO and IO detect), for each model confirmed that there did not exist an outlier in any of the three models. Therefore, we can continue with our residual analysis. BIC (Intercept) test-lag1 test-lag2 test-lag3 test-lag4 error-lag1 error-lag2 error-lag3 error-lag4 17 14 10 6.9 3.4 0.45 -2.1 -4.7
  • 8. 8 2.1.3.2 Residual Analysis The next step is to look at the residuals of our three models. From the residuals, we can talk about normality, constant error variance, and independence. Make note that from the original training data, there was large variance. Throughout transformations we were able to fix the stationarity, but so far we assume that the residuals will show there still exists large variance. Thus, we are also assuming there still exists non-normality and non-independent. However, we will conduct a formal test on our three models for each of these characteristics. As seen in Figure 8, the QQ plots do not suggest strong normality. In all three models there appears to be heavy tails and the QQ normal line does not align along our data points as well as we would wish. In our opinion, the AR(1) model has the best looking QQ plot to display normality. To verify this conclusion, we will conduct a KS test and Shapiro Wilks test that can be found in Appendix A. With a significance level of .05, we must fail to reject the null hypothesis in every test for normality using the KS and Shaprio Wilks tests for each of our three models. In each model, the p-value is greater than the significance level. This means that we are able to assume that our data is from the normal distribution. Figure 8: QQ Plots of Models Next we will look at constant error variance. This can be seen in Figure 9. As seen in the three plots, there appears to be large variances across the horizontal line y=0. However, the plot for each model does appear to resemble white noise. Thus, we can assume there is possible constant variance for each model. I was unable to perform a BP or BF test on this data because I did have the necessary x-variable to regress my residuals on and hence R would not produce these tests for me. -2 -1 0 1 2 -1.5-1.0-0.50.00.51.01.5 ARI(1,1) QQ Plot Theoretical Quantiles SampleQuantiles -2 -1 0 1 2 -1.0-0.50.00.51.01.5 IMA(1,1) QQ Plot Theoretical Quantiles SampleQuantiles -2 -1 0 1 2 -1.0-0.50.00.51.0 ARIMA(1,1,1) QQ Plot Theoretical Quantiles SampleQuantiles
  • 9. 9 Figure 9: Error Variance Analysis of Models Finally, we will look to see if our three models are independent. To do this, we will use a Runs Test for each model. Based on the runs test seen in Appendix A for each model, we conclude that our transformed data is assumed to be independent because the p-value is greater than the significance level of .05. Therefore, we fail to reject the null hypothesis where the null hypothesis states that the data is independent. Another method to test if the data is independently distributed is to look at the Ljung-Box test. The null hypothesis of this test is that the data is independently distributed. This means that the Ljung-Box test is testing whether the autocorrelations of the time series are different from zero or not. Based on the results found in Appendix A, we can conclude that we fail to reject the null hypothesis. The p-value is greater than the significance level for each of our three models. Finally, we can confirm independence once more, we can look at the ACF plot of the residuals for each model as seen in Figure 10. Since the lags are all within the blue cut off lines, we assume that the residuals resemble white noise and thus our residuals are independent. Figure 10: ACF Residuals ARI(1,1) Time Standardizedresiduals 1950 1960 1970 1980 1990 2000 2010 -2-1012 IMA(1,1) Time Standardizedresiduals 1950 1960 1970 1980 1990 2000 2010 -1012 ARIMA(1,1,1) Time Standardizedresiduals 1950 1960 1970 1980 1990 2000 2010 -1012 5 10 15 -0.2-0.10.00.10.2 Sample ACF of Residuals from ARI(1,1) Model Lag ACF 5 10 15 -0.2-0.10.00.10.2 Sample ACF of Residuals from IMA(1,1) Model Lag ACF 5 10 15 -0.2-0.10.00.10.2 Sample ACF of Residuals from ARIMA(1,1,1) Model Lag ACF
  • 10. 10 Therefore, we have been able to transform our training data into a data set that shows normality, constant variance and independence. This is normally not an easy task. However, since my data set only consisted of 66 years it was doable for such a small data set. 3 Model Validation 3.1.1 Confirmation of Models (Over fitting & Parameter Redundancy) Now we will look at confirming that the three suggested models are good models for our data set by extending the parameters for each. If the estimate of the additional parameter is not significantly different from zero and the estimates for the original model do not change significantly from their original estimates, then we can confirm that our model is a good fit. We will be concerned with a significance level of .05. For which if the ratio of the estimated coefficient per its standard deviation is less than the critical value of 1.96, then we will assume that the coefficient is not significantly different from zero. Model ARI(1,1) ARI(2,1) 𝝓 𝟐 : s.e. −. 𝟏𝟏𝟐𝟒 . 𝟏𝟐𝟗𝟖 = −. 𝟖𝟔𝟓𝟗 < 𝟏. 𝟗𝟔   ∴ 𝐧𝐨𝐧-­‐‑ 𝐬 𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭 𝝓 𝟏 : s.e. -0.4447 0.1155 -0.4970 0.1298 𝝓 𝟐 : s.e. -0.1124 0.1298 𝝓 𝟑 : s.e. 𝝓 𝟒 : s.e. 𝑨𝑰𝑪 125.67 126.93 Model IMA(1,1) IMA(1,2) 𝜽 𝟐 : s.e. − . 𝟎𝟒𝟕𝟓 . 𝟏𝟕𝟐𝟖 = −. 𝟐𝟕𝟒𝟖 < 𝟏. 𝟗𝟔   non-𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭 𝜽 𝟏 : s.e. -0.5491 0.1358 -0.5527 0.1352 𝜽 𝟐 : s.e. -0.0475 0.1728 𝜽 𝟑 : s.e. 𝜽 𝟒 : s.e. 𝑨𝑰𝑪 124.04 125.97 Model ARIMA(1,1,1) ARIMA(1,1,2) ARIMA(2,1,1) 𝝓 𝟏 : s.e. 0.3370 0.1698 -.9009 0.2298 0.3189 0.1461 𝜽 𝟏 : s.e. -0.8658 0.1027 .3474 0.2712 -0.9114 .0714 𝜽 𝟐 : s.e. -.4464 0.2057 𝝓 𝟐 : s.e. .2174 .1388 𝑨𝑰𝑪 124.97 127.71 124.52 𝜽 𝟐 : s.e. −. 𝟒𝟒𝟔𝟒 . 𝟐𝟎𝟓𝟕 = −𝟐. 𝟏𝟕𝟎𝟏 > 𝟏. 𝟗𝟔   𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭 𝝓 𝟐 : s.e. . 𝟐𝟏𝟕𝟒 . 𝟏𝟑𝟖𝟖 = 𝟏. 𝟓𝟔𝟔 < 𝟏. 𝟗𝟔   𝐧𝐨𝐧 − 𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭 If we continue to increase the order for the ARIMA model, the AIC value will continue to get larger. Generally, a small AIC value, indicates the best model. It appears that ARIMA(1,1,1) will be the best model for a mixed process based off of AIC values.
  • 11. 11 In addition, when looking at our three suggested models with the over fitting procedure, it appears that ARI(1,1) and IMA(1,1) are confirmed to be good models for this time series. With ARIMA(1,1,1), we were unable to confirm this process as a good model due to ARIMA(1,1,2) having a significant coefficient for 𝜃8. In addition, the ARIMA estimates for the over fitting were not close to the original estimates for this model. Therefore, once again, we can cannot confirm this model as a good fit for our data. Therefore, as we continue with forecasting, I will be looking at the models ARI(1,1) and IMA(1,1) for my log data set. 3.1.2 Forecasting The final procedure is to identify which of the models remaining would be the best predictor of annual tornados. We will be forecasting values for my testing data set. Recall, that we initially pulled out 6 years from the end of our data. Now we can test if our model is accurate. As seen in figure 11, we were able to make our predictions using R, which are displayed with the red dots. These predictions were made by using the one step forward forecasting procedure discussed in class. As you can tell, it is not perfect. This is partly due to the fact that this data set only contains 66 data points and these 66 points may need to be modeled using a different technique. However, we are using the techniques demonstrated in class. Figure 11: Log data Forecast In order to determine which model has better prediction capabilities, we will look at MSE, MAP, and PMAD. The smaller the values, the better the prediction abilities. Therefore, by looking at the table below we can determine that IMA(1,1) would have the best predicting capabilities based on this recommendation. Forecasting with ARI(1,1) Time log(t.data) 0 10 20 30 40 50 60 1.52.02.53.03.54.04.5 Forecasting with IMA(1,1) Time log(t.data) 0 10 20 30 40 50 60 1.52.02.53.03.54.04.5
  • 12. 12 ARI(1,1) IMA(1,1) MSE 0.0434644 MSE 0.03533288 MAP 0.0450561 MAP 0.03748878 PMAD 0.04383421 PMAD 0.03675122 Smaller Values As you can see, the values are not as small as we would like. However, with the data at hand, this seems to be pretty good. All coding used to perform my predictions are seen Appendix A. Now that we have chosen IMA(1,1) as our best model, we will transform the data back to its original meaning. The data in Figure 12 resembles the predictions made when working on the log data. The data in Figure 13, resembles the predictions made when transforming the data back to its original meaning. Both figures also contain the prediction intervals for the corresponding data sets. Figure 12: Log data Prediction Values & Interval Figure 13: Original data Prediction Values & Interval Figure 13 shows that three of our 6 values were predicted fairly close to the actually value. However, there is a lot of variability still in our model when predicting. When looking at the prediction intervals for our original data set after transforming back, we see that they are very large. This means that the predicting capability of IMA(1,1) is not extremely good. The complete list of the 95% prediction intervals for the entire original data set is seen in Appendix A. Figure 14 displays the final graph our time series data that compares the original data to the 6 predicted values.
  • 13. 13 Figure 14: Time Series Plot with Predictions (Original Data) IMA(1,1):   𝑌1 = 𝑒1 − (−.5491) 𝑒167 4 Discussion The goal of this analysis was to use knowledge of Time Series models to predict the future count of tornados in IL every year. We started with 2,406 tornado sightings in Illinois from 1950 to 2015. However, we broke our data into yearly data. Therefore, we had 66 data points. This was a fairly small data set to perform a Time Series analysis on. Keep this in mind as we continue to discuss our results. We began our analysis with a training data set. This training data set contained the first 60 years of our analysis. We set aside the last 6 years for our testing data set which became extremely important when we performed our forecasting’s. Our first goal was to ensure that our data was stationary. In order to do this, we had to perform the log transformation as well as take the difference of our data. The log transformation helped to reduce the variability seen in our original data set while the differencing allowed us to remove the explosive behavior seen in the original time series plot. We were able to ensure stationarity of our data set by performing the Dickey Fuller Test. Once we had a stationary data set, we were able to begin the estimation process. We looked at the ACF, PACF, EACF, and best subset selection chart in order to determine which models 0 10 20 30 40 50 60 020406080100120 Time Series Plot of Annual Tornados in IL 1950 - 2015 AnnualTornadosinIL
  • 14. 14 would be best. We came to the conclusion that an ARI(1,1), IMA(1,1) and ARIMA(1,1,1) would all be suitable models at this point. Next, we performed a residual analysis of all three models. As discussed in the paper, all three models were shown to have normality, constant error variance, and were independent. This is primarily due to the small sample size of our data set. When a data set is small or extremely large, these three characteristics are a lot easier to achieve. However, when we performed an over fitting on all three of these models, the ARIMA(1,1,1) was proved to be non-sufficient for this data set. Therefore, as we continued forward with our project, we focused only on the ARI(1,1) and IMA(1,1). Finally, we forecasted values for our testing data set using both ARI(1,1) and IMA(1,1). In doing so, we calculated the MSE, MAP, and PMAD for each of the models. We found that the IMA(1,1) model had the smallest numerical value in all three of these tests. This meant that for our data set, IMA(1,1) was the best model. However, make note that the values for these criterion are not as small as we would have wished. The smaller the value, the better the predictions. As seen in the final time series plot shown in figure 14, our predictions are far from perfect. In all cases, it seems that our predictions are being overestimated from the actual values. In some cases, for example in 2012, this overestimation is drastic. In order to further improve our models, we may need to try other time series models than those that were discussed in class. In addition, to better predict tornados in Illinois we may have wanted to break down our data set into quarters of the year. Clearly tornados are more frequent in the spring and summer months. In using a different division of time for this data set, we would have had a larger number of data points for which we could create different time series models from. When looking at the data set we did use for this project, the large variance over time in the count of tornados could be due to the number of people who are actually out in Illinois counting them. In the early years of this data set, tornado counts may be skewed down as people may not have been tracking them as much as we do in 2015. In addition, the number of tornados increasing over time could be due to global warming or environmental effects. In the end, this analysis shows that the model chosen to represent this data was relevant, but could have been better. As stated before, the predictions were continually overestimated. In the future, we would like to go back and test other potential models that were not discussed in this course in order to better predict the annual tornado count in Illinois.
  • 15. 15 Appendix A Reference for Model Building A.1 Training Data Transformation Codes * All codes used for this project are appended at the very end of this paper. A.2 Model Selection Dickey Fuller Test on Diff(log(t.data)
  • 16. 16 Model Estimations A.2.1 Residual Analysis Codes Normality Test: Hb: data  is  normal Hl: data  is  not  normal Since  the  p − value  in  all  three  cases  is > .05.        Therefore, fail  to  reject  Hb.
  • 17. 17 Independence Test: Hb: data  is  Independent Hl: data  is  not  Independent Since  the  p − value  in  all  three  cases  is > .05.        Therefore, fail  to  reject  Hb. Ljung-Box Test: Hb: data  is  independent   𝑟7 = 𝑟8 = ⋯ = 𝑟{ = 0 Hl: data  is  not  independent Since  the  p − value  in  all  three  cases  is > .05.        Therefore, fail  to  reject  Hb.
  • 20. 20 Code: ### Illinois Tornado Annual Count Updated (Left out 6 data points) library(TSA) library(fUnitRoots) data1<-read.csv(file="IL Total Data.csv",header=FALSE,sep=",") x1<-data1[,2] y1<-data1[,1] y_train<-y1[1:60] y_test<-y1[61:66] t.data<-ts(y_train,freq=1,start=c(1950,1)) t.data1<-ts(y_test,freq=1,start=c(2010,1)) k.data<-ts(y1,freq=1,start=c(1950,1)) #Original Time Series Plot plot(t.data,ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual Tornados in IL',type='o') plot(y=t.data,x=zlag(t.data),ylab='Tornado Count this Year',xlab='Previous Year Tornado Count',main='Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados') #log Transform plot(log(t.data),ylab='log(Annual Tornados in IL)',xlab='Time',main='Time Series Plot of Annual Tornados in IL',type='o') acf(log(t.data)) pacf(log(t.data)) eacf(log(t.data)) AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 x x o o o o o o o o o o o o 1 x o o o o o o o o o o o o o 2 o o o o o o o o o o o o o o 3 x o o o o o o o o o o o o o 4 o o o o o o o o o o o o o o 5 x o o o o o o o o o o o o o 6 x o o o o o o o o o o o o o 7 o o o o o o o o o o o o o o #First Difference Log plot(diff(log(t.data)),ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual Tornados in IL Diff(log(t.data))',type='o') #Test for Stationarity adfTest(diff(log(t.data))),lags=1,type=c("nc")) adfTest(diff(log(t.data)),lags=1,type = c("c")) adfTest(diff(log(t.data)),lags = 1, type = c("ct"))
  • 21. 21 #Model Building acf(diff(log(t.data))) #Suggests MA(1) pacf(diff(log(t.data))) #Suggests AR(1) eacf(diff(log(t.data))) # suggests AR(1) AR/MA 0 1 2 3 4 5 6 7 8 9 10 11 12 13 0 x o o o o o o o o o o o o o 1 o o o o o o o o o o o o o o 2 x o o o o o o o o o o o o o 3 x o o o o o o o o o o o o o 4 x o o o o o o o o o o o o o 5 x x o o o o o o o o o o o o 6 x o o o o o o o o o o o o o 7 o o x x x o o o o o o o o o #Best Subset suggests MA(1) as best, then ARMA(1,1) sub1<-armasubsets(diff(log(t.data)),nar=4,nma=4,y.name='test', ar.method='ols') plot(sub1) #Scatter PLot Comparison par(mfrow = c(1, 3),pty = "s") plot(y=diff(log(t.data)),x=zlag(diff(log(t.data))),ylab='Tornado Count this Year',xlab='Previous Year Tornado Count',main='Scatterplot of # IL Tornados') abline(0,0) plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=2),ylab='Tornado Count this Year',xlab='2 Years ago Tornado Count',main='Scatterplot of # of IL Torndaos') abline(0,0) plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=3),ylab='Tornado Count this Year',xlab='3 Years ago Tornado Count',main='Scatterplot of # of IL Torndaos') abline(0,0) #Fitting Models AR1<-arima(log(t.data), order = c(1, 1, 0),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') MA1<-arima(log(t.data), order = c(0, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') ARMA11<-arima(log(t.data), order = c(1, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') #No outliers were detected. detectAO(ARMA11) detectIO(ARMA11) #Resdiual Anaysis tsdiag(AR1,gof=4,omit.initial=F) tsdiag(MA1,gof=4,omit.initial=F) tsdiag(ARMA11,gof=4,omit.initial=F) #Normality op <- par(mfrow = c(1, 3),pty = "s") qqnorm(residuals(AR1),main='ARI(1,1) QQ Plot') qqline(residuals(AR1),col='red') qqnorm(residuals(MA1),main='IMA(1,1) QQ Plot') qqline(residuals(MA1),col='red') qqnorm(residuals(ARMA11),main='ARIMA(1,1,1) QQ Plot') qqline(residuals(ARMA11),col='red')
  • 22. 22 #Formal Testing ks.test(residuals(AR1),"pnorm") shapiro.test(residuals(AR1)) ks.test(residuals(MA1),"pnorm") shapiro.test(residuals(MA1)) ks.test(residuals(ARMA11),"pnorm") shapiro.test(residuals(ARMA11)) #Constant Variance op <- par(mfrow = c(1, 3),pty = "s") plot(rstandard(AR1),ylab='Standardized residuals',main='ARI(1,1)',type='o') abline(0,0,col="red",lwd=2) plot(rstandard(MA1),ylab='Standardized residuals',main='IMA(1,1)',type='o') abline(0,0,col="red",lwd=2) plot(rstandard(ARMA11),ylab='Standardized residuals',main='ARIMA(1,1,1)',type='o') abline(0,0,col="red",lwd=2) #Independence #ACF Plot op <- par(mfrow = c(1, 3),pty = "s") acf(residuals(AR1),main='Sample ACF of Residuals from ARI(1,1) Model') acf(residuals(MA1),main='Sample ACF of Residuals from IMA(1,1) Model') acf(residuals(ARMA11),main='Sample ACF of Residuals from ARIMA(1,1,1) Model') #Lijung Box.test(residuals(AR1),lag=4, type="Ljung-Box",fitdf=1) Box.test(residuals(MA1),lag=4, type="Ljung-Box",fitdf=1) Box.test(residuals(ARMA11),lag=4, type="Ljung-Box",fitdf=2) # Runs runs(residuals(AR1)) runs(residuals(MA1)) runs(residuals(ARMA11)) #Over fitting Parameter Redundancy AR2<-arima(log(t.data), order = c(2, 1, 0),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') MA2<-arima(log(t.data), order = c(0, 1, 2),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') ARMA12<-arima(log(t.data), order = c(1, 1, 2),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') ARMA21<-arima(log(t.data), order = c(2, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL, method = 'ML') #Predictions/Forecasting #ARI(1,1) Predictions pred1<-rep(NA,6) for(i in 1:6) {y_train<-y1[1:(61+i-1)] est_1<-arima(log(y_train),order=c(1,1,0),method='ML') pred1[i]<-predict(est_1,n.ahead=1)$pred} t.pred1<-ts(pred1,freq=1,start=c(2010,1)) t.pred1 Time Series: Start = 2010 End = 2015 Frequency = 1 [1] 3.926734 4.112738 3.839367 3.759854 3.944391 4.077988
  • 23. 23 log(y_test) [1] 3.891820 4.290459 3.465736 4.007333 3.891820 4.234107 plot(ts(log(y1)),type="o",main="Forecasting with ARI(1,1)",ylab="log(t.data)") points(ts(pred1,start=c(61),frequency=1),col="red") MSE=mean((log(y_test)-pred1)^2) MAP=mean(abs((log(y_test)-pred1)/(log(y_test)))) PMAD=sum(abs(log(y_test)-pred1))/sum(abs(log(y_test))) #IMA(1,1) Predictions pred1<-rep(NA,6) for(i in 1:6) {y_train<-y1[1:(61+i-1)] est_1<-arima(log(y_train),order=c(0,1,1),method='ML') pred1[i]<-predict(est_1,n.ahead=1)$pred} pred4<-rep(NA,6) for(i in 1:6) {y_train<-y1[1:(61+i-1)] est_1<-arima(log(y_train),order=c(1,1,0),method='ML') pred4[i]<-predict(est_1,n.ahead=1)$se} t.pred2<-ts(pred1,freq=1,start=c(2010,1)) t.pred2 Time Series: Time Series: Start = 2010 End = 2015 Frequency = 1 [1] 3.884078 4.067417 3.799722 3.891222 3.891483 4.041335 log(y_test) [1] 3.891820 4.290459 3.465736 4.007333 3.891820 4.234107 plot(ts(log(y1)),type="o",main="Forecasting with IMA(1,1)",ylab="log(t.data)") points(ts(pred1,start=c(61),frequency=1),col="red") MSE=mean((log(y_test)-pred1)^2) MAP=mean(abs((log(y_test)-pred1)/(log(y_test)))) PMAD=sum(abs(log(y_test)-pred1))/sum(abs(log(y_test))) #Based off of our predictions for ARI(1,1), IMA(1,1); IMA(1,1) was the best model. #Prediction Intervals for IMA(1,1) lower<-pred1-qnorm(0.975,0,1)*pred4 upper<-pred1+qnorm(0.975,0,1)*pred4 data.frame(Year=c(2010:2015),lower,upper) #Transform Model Back y_test [1] 49 73 32 55 49 69 kk<-exp(pred1 + (1/2)*(pred4)^2) [1] 61.39905 73.55173 56.25562 61.43353 61.24097 70.94283
  • 24. 24 #100(1-alpha)% prediction intervals # Create lower and upper prediction interval bounds lower1<-pred1-qnorm(0.975,0,1)*pred4 upper1<-pred1+qnorm(0.975,0,1)*pred4 data.frame(Years=c(1:66),lower1,upper1) #Original 95% Prediction Intervals data.frame(Years=c(1:66),exp(lower1),exp(upper1)) Years exp.lower1. exp.upper1. 1 1 12.74597 185.4787 2 2 15.43213 221.0484 3 3 11.82101 168.9437 4 4 13.08390 183.2885 5 5 13.21798 181.5241 6 6 15.48158 209.1434 7 7 12.74597 185.4787 8 8 15.43213 221.0484 9 9 11.82101 168.9437 10 10 13.08390 183.2885 11 11 13.21798 181.5241 12 12 15.48158 209.1434 13 13 12.74597 185.4787 14 14 15.43213 221.0484 15 15 11.82101 168.9437 16 16 13.08390 183.2885 17 17 13.21798 181.5241 18 18 15.48158 209.1434 19 19 12.74597 185.4787 20 20 15.43213 221.0484 21 21 11.82101 168.9437 22 22 13.08390 183.2885 23 23 13.21798 181.5241 24 24 15.48158 209.1434 25 25 12.74597 185.4787 26 26 15.43213 221.0484 27 27 11.82101 168.9437 28 28 13.08390 183.2885 29 29 13.21798 181.5241 30 30 15.48158 209.1434 31 31 12.74597 185.4787 32 32 15.43213 221.0484 33 33 11.82101 168.9437 34 34 13.08390 183.2885 35 35 13.21798 181.5241 36 36 15.48158 209.1434 37 37 12.74597 185.4787 38 38 15.43213 221.0484 39 39 11.82101 168.9437 40 40 13.08390 183.2885 41 41 13.21798 181.5241 42 42 15.48158 209.1434 43 43 12.74597 185.4787 44 44 15.43213 221.0484 45 45 11.82101 168.9437 46 46 13.08390 183.2885 47 47 13.21798 181.5241 48 48 15.48158 209.1434 49 49 12.74597 185.4787 50 50 15.43213 221.0484 51 51 11.82101 168.9437 52 52 13.08390 183.2885
  • 25. 25 53 53 13.21798 181.5241 54 54 15.48158 209.1434 55 55 12.74597 185.4787 56 56 15.43213 221.0484 57 57 11.82101 168.9437 58 58 13.08390 183.2885 59 59 13.21798 181.5241 60 60 15.48158 209.1434 61 61 12.74597 185.4787 62 62 15.43213 221.0484 63 63 11.82101 168.9437 64 64 13.08390 183.2885 65 65 13.21798 181.5241 66 66 15.48158 209.1434 #Convert back to Original TS PLOT IMA(1,1) plot(y1,ylab='Annual Tornados in IL',xlab='1950 - 2015',main='Time Series Plot of Annual Tornados in IL',type='o') points(ts(kk,start=c(61),frequency=1),col="red",type='o')
  • 26. 26 References [1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from http://www.spc.noaa.gov/faq/tornado/ [2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)