NEW Time Series Paper

1
Annual IL Tornado Count
Katie Ruben
April 22, 2016
As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a
cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In
addition, for this violently rotating column of air to be considered a tornado, the column must
make contact with the ground. When forecasting tornados, meteorologist’s look for four
ingredients in predicting such severe weather. These ingredients are when the “temperature and
wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for
a tornadic thunderstorm to occur[1].”
Tornado’s are an important aspect of living in certain areas of the country. They can cause death,
injury, property damage, and also high anxiety in many people who choose to live in areas prone
to Tornados. In particular, my project will deal with looking at the number of annual tornados
that have occurred in Illinois since 1950. Meteorologists are interested in improving their
understanding of the causes of tornados as well as when they are to occur. The data used during
this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data I
am looking at contains a tornado count from 1950 to 2015 for every state in the United States. I
choose to look strictly at Illinois in this date range. When doing so, I ended up with 2406
tornados over those 65 years. In particular, I am interested in forecasting the number of tornados
that will occur in subsequent years based on the time series data I have found.
In order to analyze the Illinois Tornado Count times series data I will first look to see if the data
is stationary or non-stationary. I will apply the Dickey Fuller test to determine this. Depending
on this outcome, it will help me determine which set of time series models I will want to
continue with. Depending on my original data set, I will want to perform transformations that
reduce large variance as well as take care of any explosive behavior in the data. Upon doing so, I
will then be able to generate preliminary models using ACF, PACF, EACF, and the ARMA
subset. Once several potential models have been chosen, I will fit these models by estimating
parameters using the maximum likelihood method. In addition, perform a residual analysis on
my fitted models and make sure, to the best of my ability, that the models are from the normal
distribution, are independent, and have constant variance. In order to achieve this, I will look at
the KS Test, SW Test, and QQ-plot for normality, runs test and sample autocorrelation function
for independence, and finally the BP Test for constant variance. I will continue building an
appropriate model by looking for outliers and adjusting my models based on residual analysis.
The final step is to perform a forecasting of my data set into the future. I will compare my
forecast with my actual data set to see how accurate my model has become.
[1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from
http://www.spc.noaa.gov/faq/tornado/
[2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from
http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)

2
1 Background
Tornado’s are an important aspect of living in certain areas of the country. They can cause death,
injury, property damage, and also high anxiety in many people who choose to live in areas prone
to Tornados. In particular, this project will deal with looking at the number of annual tornados
that have occurred in Illinois since 1950. Meteorologists are interested in improving their
understanding of the causes of tornados as well as when they are to occur. The data used during
this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data
being investigated contains a tornado count from 1950 to 2015 for every state in the United
States. I choose to look strictly at Illinois in this date range. When doing so, I ended up with
2406 tornados over those 66 years. In particular, I am interested in forecasting the number of
tornados that will occur in subsequent years based on the time series data found.
As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a
cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In
addition, for this violently rotating column of air to be considered a tornado, the column must
make contact with the ground. When forecasting tornados, meteorologist’s look for four
ingredients in predicting such severe weather. These ingredients are when the “temperature and
wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for
a tornadic thunderstorm to occur [1].”
Prior to starting this time series data analysis, we will split our 66 year observations into two
sets; training data and validation data. The training data set will contain 60 years (1950-2009),
while the validation data set will contain 6 years (2010-2015). The validation set contains 9% of
the total observed years. Keep in mind that over these 66 years there were 2,406 tornado
sightings in Illinois.
In this paper, we begin by using our time series data set to perform preliminary transformations
on the training set to ensure stationarity. If the data set shows non-stationary behavior, I will go
through several different transformations in section 2 of this paper. Section 2 will also contain
the model identification process for several time series models as well as estimation and residual
analysis. Since the training data contains 56 observations, the ideal maximum lag recommended
by the autocorrelation of residuals is 𝑘 = ln(56) ≈ 4. This will become important as we work
through this data set. In section 3, we will focus on model validation and choosing which of our
models is most accurate as well forecasting. In section 4 we will focus on a discussion of our
results from this project on IL tornado counts between 1950-2015.

3
2 Training Data Transformations
2.1.1 Training Data
To begin our model building process, we start by examining the training data set. The time series
plot of our Illinois Annual Tornado Count is shown in figure 1. This plot suggests that there is
extremely large variance as well as an explosive behavior being demonstrated as time passes.
This suggests that our time series data set is non-stationary. However, we need to conduct some
formal testing.
Figure 1: Training Data Time Series & Scatter Plot
The Dickey Fuller Test is used in order to determine if the data set is stationary or non-
stationary. The null hypotheses states that 𝛼 = 1 then there is a unit root and the time series in
non-stationary. The alternative hypotheses states that 𝛼 < 1 then the time series is stationary.
If the time series is non-stationary, then it is suggested to take the difference. Throughout this
paper, we will be concerned with a significance level of .05.
Time Series Plot of Annual Tornados in IL
Time
AnnualTornadosinIL
1950 1960 1970 1980 1990 2000 2010
020406080100120
0 20 40 60 80 100 120
020406080100120
Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados
Previous Year Tornado Count
TornadoCountthisYear

4
2.1.2 Transformed Training Data
2.1.2.1 Log(Training Data)
Before, we check for stationarity, we need to try to eliminate the large variance seen in our data.
To do so, we take the natural logarithmic transformation of the training time series data. The plot
is shown in Figure 2. As seen if Figure 2, there still exists large variance, as well as an explosive
behavior. It is suggested now to look at a Box Cox Transformation of the logarithm
transformation to see if we can come up with a better representation for our data set. However,
when using the Box Cox transformation in R, we get that 𝜆 = 0. This suggests that no
transformation is needed.
Figure 2: Natural Logarithm Transformation Training Data Plot
2.1.2.3 Difference on Logarithm of Preliminary Data
The final transformation used to attempt to remove the explosive behavior is to difference the
training data. As seen in Figure 3, the time series plot with this transformation looks much better.
The explosive behavior has dissipated. Looking at the scatter plot of this transformed data in
Figure 4, we see that 𝑌1
𝑣 𝑠. 𝑌167
shows a negative correlation, 𝑌1
𝑣 𝑠. 𝑌168 shows either a slight
negative correlation or no correlation and 𝑌1
𝑣 𝑠. 𝑌169 shows no correlation. Investigation of these
plots suggests that we may have a time series model of order 1. We will conduct formal model
selections next.
Time
log(AnnualTornadosinIL)
1950 1960 1970 1980 1990 2000 2010
1.52.02.53.03.54.04.5

5
Figure 3: Difference Training Data Time Series Plots
Figure 4: Difference Training Data Scatter Plots
There no longer seems to be an apparent explosive behavior in the times series plot when taking
the difference log transform. This suggests stationarity in our transformed training data.
However, a formal Dickey Fuller Test must be applied. In doing so, we get a 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = .01 <
.05 = 𝛼 for lag = 1,2 for non-constant, constant, and linear trends. Therefore, since the p-value is
less than the significance level we reject the null hypothesis and our model is suggested to be
stationary. A sample R output is shown in Appendix A.
Time Series Plot of Annual Tornados in IL Diff(log(t.data))
Time
AnnualTornadosinIL
1950 1960 1970 1980 1990 2000 2010
-1.5-1.0-0.50.00.51.01.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1.5-1.0-0.50.00.51.01.5
Scatterplot of # IL Tornados
Previous Year Tornado Count
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1.5-1.0-0.50.00.51.01.5
Scatterplot of # of IL Torndaos
2 Years ago Tornado Count
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1.5-1.0-0.50.00.51.01.5
Scatterplot of # of IL Torndaos
3 Years ago Tornado Count

6
2.1.3 Preliminary Model Building
Now that we have figured out how to eliminate the large explosive behavior in our data set, we
can begin to look at finding preliminary models to build an appropriate times series model from.
In order to do this, we will look at the ACF plot for potential Moving Average models, PACF
plot for potential Autoregressive models, EACF chart for potential Mixed Process models, and
ARMA subset selections. Figure 5 shows the ACF, PACF and EACF plots for our difference
model chosen above.
Figure 5: ACF, PACF, & EACF
The ACF plot suggests that with our data set, a Moving Average of order 1 may be a potential
model. The PACF plot suggests that a Autoregressive of order 1 may be a potential model. One
aspect to keep in mind is that the PACF should show lags that exponentially decay theoretically.
The PACF plot for this data set does not follow this exponential decaying pattern. Therefore, an
Autoregressive model may not be the best suited model. The EACF plot suggests again an
AR(1) We can also determine the best potential model by looking at the ARMA subset based on
BIC or AIC values. This output is displayed in Figure 6. The maximum number of lags allowed
is 𝑘 = ln(61) ≈ 4 based on the autocorrelation of residuals recommendation from literature on
the topic. This output suggests that the best model for my data would be MA(1) with an intercept
term. This is the same suggestion made by the ACF plot. The second best suggestion would be
an ARMA(1,1) process. Throughout the rest of this project, I will work with the following
processes; ARI(1,1), IMA(1,1), and ARIMA(1,1,1).
5 10 15
-0.4-0.3-0.2-0.10.00.10.2
Series diff(log(t.data))
Lag
ACF
5 10 15
-0.4-0.3-0.2-0.10.00.10.2
Lag
PartialACF
Series diff(log(t.data))

7
Figure 6: ARMA Subset BIC
2.1.3.1 Estimations
Using maximum likelihood estimates, we were able to come up with suitable models for
ARI(1,1), IMA(1,1), and ARIMA(1,1,1). When working with time series models, it is best to
always choose a simple model to explain your data. Figure 7 displays the following estimates for
each of these model. Note that the intercept terms were not significant enough at a level of .05,
so they don’t need to be included in the models. We determined the significance by looking at
the output in R for the estimations and found that when looking at the ratio of the intercept
coefficient to the standard error of that coefficient, the value was close to zero when compared to
the critical value of 1.96. This indicated that the intercept of the coefficient was not significantly
different from zero since the ratio of estimation per standard error was less than 1.96. Note that
my models were estimated using the log data.
ARI(1,1): 𝑌1 = −0.4447 𝑌167 + 𝑒1
IMA(1,1):
𝑌1 = 𝑒1 − (−.5491) 𝑒167
ARIMA(1,1,1): 𝑌1 = .3370 𝑌167 + 𝑒1 − (−.8658)(𝑒167)
Figure 7: Model Estimates
2.1.3.1.1 Outliers
Before proceeding further, we must determine if there exist outliers for each of our potential
models. In R, we ran the additive outlier and innovational outlier commands. Both commands in
R, (AO and IO detect), for each model confirmed that there did not exist an outlier in any of the
three models. Therefore, we can continue with our residual analysis.
BIC
(Intercept)
test-lag1
test-lag2
test-lag3
test-lag4
error-lag1
error-lag2
error-lag3
error-lag4
17
14
10
6.9
3.4
0.45
-2.1
-4.7

8
2.1.3.2 Residual Analysis
The next step is to look at the residuals of our three models. From the residuals, we can talk
about normality, constant error variance, and independence. Make note that from the original
training data, there was large variance. Throughout transformations we were able to fix the
stationarity, but so far we assume that the residuals will show there still exists large variance.
Thus, we are also assuming there still exists non-normality and non-independent. However, we
will conduct a formal test on our three models for each of these characteristics.
As seen in Figure 8, the QQ plots do not suggest strong normality. In all three models there
appears to be heavy tails and the QQ normal line does not align along our data points as well as
we would wish. In our opinion, the AR(1) model has the best looking QQ plot to display
normality. To verify this conclusion, we will conduct a KS test and Shapiro Wilks test that can
be found in Appendix A. With a significance level of .05, we must fail to reject the null
hypothesis in every test for normality using the KS and Shaprio Wilks tests for each of our three
models. In each model, the p-value is greater than the significance level. This means that we are
able to assume that our data is from the normal distribution.
Figure 8: QQ Plots of Models
Next we will look at constant error variance. This can be seen in Figure 9. As seen in the three
plots, there appears to be large variances across the horizontal line y=0. However, the plot for
each model does appear to resemble white noise. Thus, we can assume there is possible constant
variance for each model. I was unable to perform a BP or BF test on this data because I did have
the necessary x-variable to regress my residuals on and hence R would not produce these tests
for me.
-2 -1 0 1 2
-1.5-1.0-0.50.00.51.01.5
ARI(1,1) QQ Plot
Theoretical Quantiles
SampleQuantiles
-2 -1 0 1 2
-1.0-0.50.00.51.01.5
IMA(1,1) QQ Plot
SampleQuantiles
-2 -1 0 1 2
-1.0-0.50.00.51.0
ARIMA(1,1,1) QQ Plot
SampleQuantiles

9
Figure 9: Error Variance Analysis of Models
Finally, we will look to see if our three models are independent. To do this, we will use a Runs
Test for each model. Based on the runs test seen in Appendix A for each model, we conclude that
our transformed data is assumed to be independent because the p-value is greater than the
significance level of .05. Therefore, we fail to reject the null hypothesis where the null
hypothesis states that the data is independent. Another method to test if the data is independently
distributed is to look at the Ljung-Box test. The null hypothesis of this test is that the data is
independently distributed. This means that the Ljung-Box test is testing whether the
autocorrelations of the time series are different from zero or not. Based on the results found in
Appendix A, we can conclude that we fail to reject the null hypothesis. The p-value is greater
than the significance level for each of our three models. Finally, we can confirm independence
once more, we can look at the ACF plot of the residuals for each model as seen in Figure 10.
Since the lags are all within the blue cut off lines, we assume that the residuals resemble white
noise and thus our residuals are independent.
Figure 10: ACF Residuals
ARI(1,1)
Time
Standardizedresiduals
1950 1960 1970 1980 1990 2000 2010
-2-1012
IMA(1,1)
Time
1950 1960 1970 1980 1990 2000 2010
-1012
ARIMA(1,1,1)
Time
1950 1960 1970 1980 1990 2000 2010
-1012
5 10 15
-0.2-0.10.00.10.2
Sample ACF of Residuals from ARI(1,1) Model
Lag
ACF
5 10 15
-0.2-0.10.00.10.2
Sample ACF of Residuals from IMA(1,1) Model
Lag
ACF
5 10 15
-0.2-0.10.00.10.2
Sample ACF of Residuals from ARIMA(1,1,1) Model
Lag
ACF

10
Therefore, we have been able to transform our training data into a data set that shows normality,
constant variance and independence. This is normally not an easy task. However, since my data
set only consisted of 66 years it was doable for such a small data set.
3 Model Validation
3.1.1 Confirmation of Models (Over fitting & Parameter Redundancy)
Now we will look at confirming that the three suggested models are good models for our data set
by extending the parameters for each. If the estimate of the additional parameter is not
significantly different from zero and the estimates for the original model do not change
significantly from their original estimates, then we can confirm that our model is a good fit. We
will be concerned with a significance level of .05. For which if the ratio of the estimated
coefficient per its standard deviation is less than the critical value of 1.96, then we will assume
that the coefficient is not significantly different from zero.
Model ARI(1,1) ARI(2,1)
𝝓 𝟐 : s.e.
−. 𝟏𝟏𝟐𝟒
. 𝟏𝟐𝟗𝟖
= −. 𝟖𝟔𝟓𝟗 < 𝟏. 𝟗𝟔
∴ 𝐧𝐨𝐧-‐‑ 𝐬 𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
𝝓 𝟏 : s.e. -0.4447 0.1155 -0.4970 0.1298
𝝓 𝟐 : s.e. -0.1124 0.1298
𝝓 𝟑 : s.e.
𝝓 𝟒 : s.e.
𝑨𝑰𝑪 125.67 126.93
Model IMA(1,1) IMA(1,2)
𝜽 𝟐 : s.e.
−
. 𝟎𝟒𝟕𝟓
. 𝟏𝟕𝟐𝟖
= −. 𝟐𝟕𝟒𝟖 < 𝟏. 𝟗𝟔

non-𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
𝜽 𝟏 : s.e. -0.5491 0.1358 -0.5527 0.1352
𝜽 𝟐 : s.e. -0.0475 0.1728
𝜽 𝟑 : s.e.
𝜽 𝟒 : s.e.
𝑨𝑰𝑪 124.04 125.97
Model ARIMA(1,1,1) ARIMA(1,1,2) ARIMA(2,1,1)
𝝓 𝟏 : s.e. 0.3370 0.1698 -.9009 0.2298 0.3189 0.1461
𝜽 𝟏 : s.e. -0.8658 0.1027 .3474 0.2712 -0.9114 .0714
𝜽 𝟐 : s.e. -.4464 0.2057
𝝓 𝟐 : s.e. .2174 .1388
𝑨𝑰𝑪 124.97 127.71 124.52
𝜽 𝟐 : s.e.
−. 𝟒𝟒𝟔𝟒
. 𝟐𝟎𝟓𝟕
= −𝟐. 𝟏𝟕𝟎𝟏 > 𝟏. 𝟗𝟔

𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
𝝓 𝟐 : s.e.
. 𝟐𝟏𝟕𝟒
. 𝟏𝟑𝟖𝟖
= 𝟏. 𝟓𝟔𝟔 < 𝟏. 𝟗𝟔

𝐧𝐨𝐧 − 𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
If we continue to increase the order for the ARIMA model, the AIC value will continue to get
larger. Generally, a small AIC value, indicates the best model. It appears that ARIMA(1,1,1) will
be the best model for a mixed process based off of AIC values.

11
In addition, when looking at our three suggested models with the over fitting procedure, it
appears that ARI(1,1) and IMA(1,1) are confirmed to be good models for this time series. With
ARIMA(1,1,1), we were unable to confirm this process as a good model due to ARIMA(1,1,2)
having a significant coefficient for 𝜃8. In addition, the ARIMA estimates for the over fitting were
not close to the original estimates for this model. Therefore, once again, we can cannot confirm
this model as a good fit for our data.
Therefore, as we continue with forecasting, I will be looking at the models ARI(1,1) and
IMA(1,1) for my log data set.
3.1.2 Forecasting
The final procedure is to identify which of the models remaining would be the best predictor of
annual tornados. We will be forecasting values for my testing data set. Recall, that we initially
pulled out 6 years from the end of our data. Now we can test if our model is accurate. As seen in
figure 11, we were able to make our predictions using R, which are displayed with the red dots.
These predictions were made by using the one step forward forecasting procedure discussed in
class. As you can tell, it is not perfect. This is partly due to the fact that this data set only
contains 66 data points and these 66 points may need to be modeled using a different technique.
However, we are using the techniques demonstrated in class.
Figure 11: Log data Forecast
In order to determine which model has better prediction capabilities, we will look at MSE, MAP,
and PMAD. The smaller the values, the better the prediction abilities. Therefore, by looking at
the table below we can determine that IMA(1,1) would have the best predicting capabilities
based on this recommendation.
Forecasting with ARI(1,1)
Time
log(t.data)
0 10 20 30 40 50 60
1.52.02.53.03.54.04.5
Forecasting with IMA(1,1)
Time
log(t.data)
0 10 20 30 40 50 60
1.52.02.53.03.54.04.5

12
ARI(1,1) IMA(1,1)
MSE 0.0434644 MSE 0.03533288
MAP 0.0450561 MAP 0.03748878
PMAD 0.04383421 PMAD 0.03675122
Smaller Values
As you can see, the values are not as small as we would like. However, with the data at hand, this
seems to be pretty good. All coding used to perform my predictions are seen Appendix A.
Now that we have chosen IMA(1,1) as our best model, we will transform the data back to its
original meaning. The data in Figure 12 resembles the predictions made when working on the log
data. The data in Figure 13, resembles the predictions made when transforming the data back to
its original meaning. Both figures also contain the prediction intervals for the corresponding data
sets.
Figure 12: Log data Prediction Values & Interval
Figure 13: Original data Prediction Values & Interval
Figure 13 shows that three of our 6 values were predicted fairly close to the actually value.
However, there is a lot of variability still in our model when predicting. When looking at the
prediction intervals for our original data set after transforming back, we see that they are very
large. This means that the predicting capability of IMA(1,1) is not extremely good. The
complete list of the 95% prediction intervals for the entire original data set is seen in Appendix
A. Figure 14 displays the final graph our time series data that compares the original data to the 6
predicted values.

13
Figure 14: Time Series Plot with Predictions (Original Data)
IMA(1,1):
𝑌1 = 𝑒1 − (−.5491) 𝑒167
4 Discussion
The goal of this analysis was to use knowledge of Time Series models to predict the future count
of tornados in IL every year. We started with 2,406 tornado sightings in Illinois from 1950 to
2015. However, we broke our data into yearly data. Therefore, we had 66 data points. This was a
fairly small data set to perform a Time Series analysis on. Keep this in mind as we continue to
discuss our results.
We began our analysis with a training data set. This training data set contained the first 60 years
of our analysis. We set aside the last 6 years for our testing data set which became extremely
important when we performed our forecasting’s. Our first goal was to ensure that our data was
stationary. In order to do this, we had to perform the log transformation as well as take the
difference of our data. The log transformation helped to reduce the variability seen in our
original data set while the differencing allowed us to remove the explosive behavior seen in the
original time series plot. We were able to ensure stationarity of our data set by performing the
Dickey Fuller Test.
Once we had a stationary data set, we were able to begin the estimation process. We looked at
the ACF, PACF, EACF, and best subset selection chart in order to determine which models
0 10 20 30 40 50 60
020406080100120
1950 - 2015
AnnualTornadosinIL

14
would be best. We came to the conclusion that an ARI(1,1), IMA(1,1) and ARIMA(1,1,1) would
all be suitable models at this point. Next, we performed a residual analysis of all three models.
As discussed in the paper, all three models were shown to have normality, constant error
variance, and were independent. This is primarily due to the small sample size of our data set.
When a data set is small or extremely large, these three characteristics are a lot easier to achieve.
However, when we performed an over fitting on all three of these models, the ARIMA(1,1,1)
was proved to be non-sufficient for this data set. Therefore, as we continued forward with our
project, we focused only on the ARI(1,1) and IMA(1,1).
Finally, we forecasted values for our testing data set using both ARI(1,1) and IMA(1,1). In doing
so, we calculated the MSE, MAP, and PMAD for each of the models. We found that the
IMA(1,1) model had the smallest numerical value in all three of these tests. This meant that for
our data set, IMA(1,1) was the best model. However, make note that the values for these
criterion are not as small as we would have wished. The smaller the value, the better the
predictions. As seen in the final time series plot shown in figure 14, our predictions are far from
perfect. In all cases, it seems that our predictions are being overestimated from the actual values.
In some cases, for example in 2012, this overestimation is drastic.
In order to further improve our models, we may need to try other time series models than those
that were discussed in class. In addition, to better predict tornados in Illinois we may have
wanted to break down our data set into quarters of the year. Clearly tornados are more frequent
in the spring and summer months. In using a different division of time for this data set, we would
have had a larger number of data points for which we could create different time series models
from. When looking at the data set we did use for this project, the large variance over time in the
count of tornados could be due to the number of people who are actually out in Illinois counting
them. In the early years of this data set, tornado counts may be skewed down as people may not
have been tracking them as much as we do in 2015. In addition, the number of tornados
increasing over time could be due to global warming or environmental effects.
In the end, this analysis shows that the model chosen to represent this data was relevant, but
could have been better. As stated before, the predictions were continually overestimated. In the
future, we would like to go back and test other potential models that were not discussed in this
course in order to better predict the annual tornado count in Illinois.

15
Appendix
A Reference for Model Building
A.1 Training Data Transformation Codes
* All codes used for this project are appended at the very end of this paper.
A.2 Model Selection
Dickey Fuller Test on Diff(log(t.data)

16
Model Estimations
A.2.1 Residual Analysis Codes
Normality Test:
Hb: data
is
normal
Hl: data
is
not
normal
Since
the
p − value
in
all
three
cases
is >
.05.

Therefore, fail
to
reject
Hb.

17
Independence Test:
Hb: data
is
Independent
Hl: data
is
not
Independent
Since
the
p − value
in
all
three
cases
is >
.05.

Therefore, fail
to
reject
Hb.
Ljung-Box Test:
Hb: data
is
independent
𝑟7 = 𝑟8 = ⋯ = 𝑟{ = 0
Hl: data
is
not
independent
Since
the
p − value
in
all
three
cases
is > .05.

Therefore, fail
to
reject
Hb.

18
A.3 Model Validation
Over fitting Models

20
Code:
### Illinois Tornado Annual Count Updated (Left out 6 data points)
library(TSA)
library(fUnitRoots)
data1<-read.csv(file="IL Total Data.csv",header=FALSE,sep=",")
x1<-data1[,2]
y1<-data1[,1]
y_train<-y1[1:60]
y_test<-y1[61:66]
t.data<-ts(y_train,freq=1,start=c(1950,1))
t.data1<-ts(y_test,freq=1,start=c(2010,1))
k.data<-ts(y1,freq=1,start=c(1950,1))
#Original Time Series Plot
plot(t.data,ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual Tornados in
IL',type='o')
plot(y=t.data,x=zlag(t.data),ylab='Tornado Count this Year',xlab='Previous Year Tornado
Count',main='Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados')
#log Transform
plot(log(t.data),ylab='log(Annual Tornados in IL)',xlab='Time',main='Time Series Plot of Annual
Tornados in IL',type='o')
acf(log(t.data))
pacf(log(t.data))
eacf(log(t.data))
AR/MA
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x o o o o o o o o o o o o
1 x o o o o o o o o o o o o o
2 o o o o o o o o o o o o o o
#First Difference Log
plot(diff(log(t.data)),ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual
Tornados in IL Diff(log(t.data))',type='o')
#Test for Stationarity
adfTest(diff(log(t.data))),lags=1,type=c("nc"))
adfTest(diff(log(t.data)),lags=1,type = c("c"))
adfTest(diff(log(t.data)),lags = 1, type = c("ct"))

21
#Model Building
acf(diff(log(t.data))) #Suggests MA(1)
pacf(diff(log(t.data))) #Suggests AR(1)
eacf(diff(log(t.data))) # suggests AR(1)
AR/MA
0 1 2 3 4 5 6 7 8 9 10 11 12 13
5 x x o o o o o o o o o o o o
7 o o x x x o o o o o o o o o
#Best Subset suggests MA(1) as best, then ARMA(1,1)
sub1<-armasubsets(diff(log(t.data)),nar=4,nma=4,y.name='test',
ar.method='ols')
plot(sub1)
#Scatter PLot Comparison
par(mfrow = c(1, 3),pty = "s")
plot(y=diff(log(t.data)),x=zlag(diff(log(t.data))),ylab='Tornado Count this Year',xlab='Previous
Year Tornado Count',main='Scatterplot of # IL Tornados')
abline(0,0)
plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=2),ylab='Tornado Count this Year',xlab='2
Years ago Tornado Count',main='Scatterplot of # of IL Torndaos')
abline(0,0)
plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=3),ylab='Tornado Count this Year',xlab='3
Years ago Tornado Count',main='Scatterplot of # of IL Torndaos')
abline(0,0)
#Fitting Models
AR1<-arima(log(t.data), order = c(1, 1, 0),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
MA1<-arima(log(t.data), order = c(0, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
ARMA11<-arima(log(t.data), order = c(1, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL,
method = 'ML')
#No outliers were detected.
detectAO(ARMA11)
detectIO(ARMA11)
#Resdiual Anaysis
tsdiag(AR1,gof=4,omit.initial=F)
tsdiag(MA1,gof=4,omit.initial=F)
tsdiag(ARMA11,gof=4,omit.initial=F)
#Normality
op <- par(mfrow = c(1, 3),pty = "s")
qqnorm(residuals(AR1),main='ARI(1,1) QQ Plot')
qqline(residuals(AR1),col='red')
qqnorm(residuals(MA1),main='IMA(1,1) QQ Plot')
qqline(residuals(MA1),col='red')
qqnorm(residuals(ARMA11),main='ARIMA(1,1,1) QQ Plot')
qqline(residuals(ARMA11),col='red')

22
#Formal Testing
ks.test(residuals(AR1),"pnorm")
shapiro.test(residuals(AR1))
ks.test(residuals(MA1),"pnorm")
shapiro.test(residuals(MA1))
ks.test(residuals(ARMA11),"pnorm")
shapiro.test(residuals(ARMA11))
#Constant Variance
plot(rstandard(AR1),ylab='Standardized residuals',main='ARI(1,1)',type='o')
abline(0,0,col="red",lwd=2)
plot(rstandard(MA1),ylab='Standardized residuals',main='IMA(1,1)',type='o')
plot(rstandard(ARMA11),ylab='Standardized residuals',main='ARIMA(1,1,1)',type='o')
#Independence
#ACF Plot
acf(residuals(AR1),main='Sample ACF of Residuals from ARI(1,1) Model')
acf(residuals(MA1),main='Sample ACF of Residuals from IMA(1,1) Model')
acf(residuals(ARMA11),main='Sample ACF of Residuals from ARIMA(1,1,1) Model')
#Lijung
Box.test(residuals(AR1),lag=4, type="Ljung-Box",fitdf=1)
Box.test(residuals(MA1),lag=4, type="Ljung-Box",fitdf=1)
Box.test(residuals(ARMA11),lag=4, type="Ljung-Box",fitdf=2)
# Runs
runs(residuals(AR1))
runs(residuals(MA1))
runs(residuals(ARMA11))
#Over fitting Parameter Redundancy
AR2<-arima(log(t.data), order = c(2, 1, 0),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
MA2<-arima(log(t.data), order = c(0, 1, 2),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
method = 'ML')
method = 'ML')
#Predictions/Forecasting
#ARI(1,1) Predictions
pred1<-rep(NA,6)
for(i in 1:6) {y_train<-y1[1:(61+i-1)]
est_1<-arima(log(y_train),order=c(1,1,0),method='ML')
pred1[i]<-predict(est_1,n.ahead=1)$pred}
t.pred1<-ts(pred1,freq=1,start=c(2010,1))
t.pred1
Time Series:
Start = 2010
End = 2015
Frequency = 1
[1] 3.926734 4.112738 3.839367 3.759854 3.944391 4.077988

23
log(y_test)
[1] 3.891820 4.290459 3.465736 4.007333 3.891820 4.234107
plot(ts(log(y1)),type="o",main="Forecasting with ARI(1,1)",ylab="log(t.data)")
points(ts(pred1,start=c(61),frequency=1),col="red")
MSE=mean((log(y_test)-pred1)^2)
MAP=mean(abs((log(y_test)-pred1)/(log(y_test))))
PMAD=sum(abs(log(y_test)-pred1))/sum(abs(log(y_test)))
#IMA(1,1) Predictions
pred1<-rep(NA,6)
for(i in 1:6) {y_train<-y1[1:(61+i-1)]
pred1[i]<-predict(est_1,n.ahead=1)$pred}
pred4<-rep(NA,6)
for(i in 1:6) {y_train<-y1[1:(61+i-1)]
pred4[i]<-predict(est_1,n.ahead=1)$se}
t.pred2<-ts(pred1,freq=1,start=c(2010,1))
t.pred2
Time Series:
Time Series:
Start = 2010
End = 2015
Frequency = 1
[1] 3.884078 4.067417 3.799722 3.891222 3.891483 4.041335
log(y_test)
[1] 3.891820 4.290459 3.465736 4.007333 3.891820 4.234107
plot(ts(log(y1)),type="o",main="Forecasting with IMA(1,1)",ylab="log(t.data)")
points(ts(pred1,start=c(61),frequency=1),col="red")
MSE=mean((log(y_test)-pred1)^2)
MAP=mean(abs((log(y_test)-pred1)/(log(y_test))))
PMAD=sum(abs(log(y_test)-pred1))/sum(abs(log(y_test)))
#Based off of our predictions for ARI(1,1), IMA(1,1); IMA(1,1) was the best model.
#Prediction Intervals for IMA(1,1)
lower<-pred1-qnorm(0.975,0,1)*pred4
upper<-pred1+qnorm(0.975,0,1)*pred4
data.frame(Year=c(2010:2015),lower,upper)
#Transform Model Back
y_test
[1] 49 73 32 55 49 69
kk<-exp(pred1 + (1/2)*(pred4)^2)
[1] 61.39905 73.55173 56.25562 61.43353 61.24097 70.94283

24
#100(1-alpha)% prediction intervals
# Create lower and upper prediction interval bounds
lower1<-pred1-qnorm(0.975,0,1)*pred4
upper1<-pred1+qnorm(0.975,0,1)*pred4
data.frame(Years=c(1:66),lower1,upper1)
#Original 95% Prediction Intervals
data.frame(Years=c(1:66),exp(lower1),exp(upper1))
Years exp.lower1. exp.upper1.
1 1 12.74597 185.4787
2 2 15.43213 221.0484
3 3 11.82101 168.9437
4 4 13.08390 183.2885
5 5 13.21798 181.5241
6 6 15.48158 209.1434
7 7 12.74597 185.4787
8 8 15.43213 221.0484
9 9 11.82101 168.9437
10 10 13.08390 183.2885
11 11 13.21798 181.5241
12 12 15.48158 209.1434
13 13 12.74597 185.4787
14 14 15.43213 221.0484
15 15 11.82101 168.9437
16 16 13.08390 183.2885
17 17 13.21798 181.5241
18 18 15.48158 209.1434
19 19 12.74597 185.4787
20 20 15.43213 221.0484
21 21 11.82101 168.9437
22 22 13.08390 183.2885
23 23 13.21798 181.5241
24 24 15.48158 209.1434
25 25 12.74597 185.4787
26 26 15.43213 221.0484
27 27 11.82101 168.9437
28 28 13.08390 183.2885
29 29 13.21798 181.5241
30 30 15.48158 209.1434
31 31 12.74597 185.4787
32 32 15.43213 221.0484
33 33 11.82101 168.9437
34 34 13.08390 183.2885
35 35 13.21798 181.5241
36 36 15.48158 209.1434
37 37 12.74597 185.4787
38 38 15.43213 221.0484
39 39 11.82101 168.9437
40 40 13.08390 183.2885
41 41 13.21798 181.5241
42 42 15.48158 209.1434
43 43 12.74597 185.4787
44 44 15.43213 221.0484
45 45 11.82101 168.9437
46 46 13.08390 183.2885
47 47 13.21798 181.5241
48 48 15.48158 209.1434
49 49 12.74597 185.4787
50 50 15.43213 221.0484
51 51 11.82101 168.9437
52 52 13.08390 183.2885

25
53 53 13.21798 181.5241
54 54 15.48158 209.1434
55 55 12.74597 185.4787
56 56 15.43213 221.0484
57 57 11.82101 168.9437
58 58 13.08390 183.2885
59 59 13.21798 181.5241
60 60 15.48158 209.1434
61 61 12.74597 185.4787
62 62 15.43213 221.0484
63 63 11.82101 168.9437
64 64 13.08390 183.2885
65 65 13.21798 181.5241
66 66 15.48158 209.1434
#Convert back to Original TS PLOT IMA(1,1)
plot(y1,ylab='Annual Tornados in IL',xlab='1950 - 2015',main='Time Series Plot of Annual
Tornados in IL',type='o')
points(ts(kk,start=c(61),frequency=1),col="red",type='o')

26
References
[1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from
http://www.spc.noaa.gov/faq/tornado/
[2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from
http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)

NEW Time Series Paper

Recommended

Recommended

More Related Content

Similar to NEW Time Series Paper

Similar to NEW Time Series Paper (20)

NEW Time Series Paper