This paper is a methodological exercices presenting the results obtained from the estimation of the growth convergence equation using different methodologies.
A dynamic balanced panel data is estimated using: OLS, WithinGroup, HsiaoAnderson, First Difference, GMM with endogenous and GMM with predetermined instruments. An unbalanced panel is also realized for OLS, WG and FD.
Results are discused in light of Monte Carlo studies.
This paper is a methodological exercices presenting the results obtained from the estimation of the growth convergence equation using different methodologies.
A dynamic balanced panel data is estimated using: OLS, WithinGroup, HsiaoAnderson, First Difference, GMM with endogenous and GMM with predetermined instruments. An unbalanced panel is also realized for OLS, WG and FD.
Results are discused in light of Monte Carlo studies.
This Project is helpful for Time Series Analysis Forecasting. Better accuracy and metrics
in short-term forecasting are provided for intermediate planning for the target to reduce
CO2 emissions. Implementing different models like Exponential techniques, Linear
statistical modeling, and Autoregressive are used to forecast the emissions and finally
deployed on Stream lit.
Running head accident and casual factor events1accident and ca.docxSUBHI7
Running head: accident and casual factor events 1
accident and casual factor events 7Accident and Casual Factor Events
Wesley D. Herron
Waldorf University
Accident And Causal Factor Events
The causal factor and events chart describe a graphical and written description for the time sequence of events contributing the various events related to an accident. The construction of the causal factor and events factor chart helps investigators to conduct an in-depth research and identifying the root cause of accidents. The following are charts showing the various elements involved in event charting:
· Condition- This refers to the distinct state that facilitates event occurrence. a condition may be weather, equipment status, the health status of an employee as well as any other factor affecting an event
· Event-Refers to the point in time described by the occurrence of specific actions.
· Accident- describesthe action, state and even condition where a system is not meeting one or more of its design objectives (Oakley, 2012). This includes real accidents and also near misses. This is the main focus of the analysis or even the evaluation.
· Primary event line- These are the main sequences of occurrence that resulted in the accidents. The primary event line gives the basic nature of an occurrence in a logical progression. However, it does not provide the contributing causes, and it often contains the accident but does not essentially end with an accident event. the primary event line comprises of both conditions and events
· Primary events and conditions- the conditions and events described make up the primary event line.
· Secondary event lines. These are a series of occurrences that lead to primary conditions and events. Secondary event lines enhance the development of the primary event line to show all the facilitating elements of an event. Causal factors are often found in secondary event lines, and they have both the events and conditions.
· Secondary events and conditions- these are conditions and events that make up the secondary event line.
Causal factors
These are key conditions and events that when eliminated could have prevented or reduced an accident and its effects (Oakley, 2012). Causal factors include equipment failure, human error and they include activities such as initiating event for an accident, failed safeguards, and having reasonable safeguards that were not provided.
Items of note
These are undesirable events, and conditions that have been identified in an analysis and they have to be addressed or corrected. However, the events did not contribute to the focus of the accident, and they are shown as separate boxes specifically outside the event chain.
Events and accidents are examined to identify the causes of their event and to decide the moves that must be made to avoid a repeat. It is basic that the accident investigators profoundly investigate into both the events and the conditions ...
Storm Prediction data analysis using R/SASGautam Sawant
• Performed data cleaning and analysis using R, SAS to predict financial loss caused due to storms also predict when a storm will occur depending upon previous storm data
• Implemented algorithms like Logistic Regression, Multiple Regression, Linear Discriminant Analysis, PCA to obtain insights from the Storm Dataset from 1950-2007
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting.
Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...Bayu imadul Bilad
Coronavirus disease 2019 (COVID-19) is a disease that is currently endemic almost all over the world, with the name Severe Acute Respiratory Syndrome Coronavirus-2 (SARS- COV2). The origin of this virus originated from Wuhan, China, at the end of December 2019. The impact of COVID-19 is not only on death but also impacts on the country's economy. If not treated immediately, it will cause even more casualties and other losses. In this study, an analysis of the impact of interventions will be carried out as a policy undertaken by several countries in tackling the COVID-19 virus and predicting the increase in positive cases of COVID-19 for the next few days. The countries are Indonesia, South Korea, and Singapore. In this study, the COVID-19 case prediction model uses the multi intervention analysis method with Box-Jenkins ARIMA. The result of this research is the policies taken by the government do not significantly influence the decrease cases of COVID-19. In South Korea, the policy to enforce the Massive Rapid Test significantly and permanently can reduce the increase in positive cases of COVID-19. On the other hand, the policy to open several public sectors significantly and temporarily can increase the addition of positive cases of COVID-19 in South Korea 5 times. In Singapore, the policy to enforce lockdown and social distancing can reduce the increase in positive cases of COVID-19 until the second intervention occurs. The second intervention has a direct effect and the effect of this decrease is continue until the last observation. it is not significant enough to increase the positive cases of COVID-19. In Indonesia, the policy for conducting Large-Scale Social Restrictions (PSBB) has a significant effect in the third period after the first intervention. The effect of this intervention makes the addition of positive cases of COVID-19 in Indonesia more stable, which is the increase of positive cases of COVID-19 in approximately 300’s. Keywords: COVID-19, ARIMA model, Intervention Analysis
A time series is a progression of information focuses listed in time request. Normally, a time series is a grouping taken at progressive similarly dispersed focuses in time. In this way, it is a succession of discrete-time information
Option pricing under multiscale stochastic volatilityFGV Brazil
The stochastic volatility model proposed by Fouque, Papanicolaou, and Sircar (2000) explores a fast and a slow time-scale fluctuation of the volatility process to end up with a parsimonious way of capturing the volatility smile implied by close to the money options. In this paper, we test three different models of these authors using options on the S&P 500. First, we use model independent statistical tools to demonstrate the presence of a short time-scale, on the order of days, and a long time-scale, on the order of months, in the S&P 500 volatility. Our analysis of market data shows that both time-scales are statistically significant. We also provide a calibration method using observed option prices as represented by the so-called term structure of implied volatility. The resulting approximation is still independent of the particular details of the volatility model and gives more flexibility in the parametrization of the implied volatility surface. In addition, to test the model’s ability to price options, we simulate options prices using four different specifications for the Data generating Process. As an illustration, we price an exotic option.
Ongoing Master Thesis by Cristina Tessari and Caio Almeida;
EPGE Brazilian School of Economics and Finance.
http://www.fgv.br/epge/en
This Project is helpful for Time Series Analysis Forecasting. Better accuracy and metrics
in short-term forecasting are provided for intermediate planning for the target to reduce
CO2 emissions. Implementing different models like Exponential techniques, Linear
statistical modeling, and Autoregressive are used to forecast the emissions and finally
deployed on Stream lit.
Running head accident and casual factor events1accident and ca.docxSUBHI7
Running head: accident and casual factor events 1
accident and casual factor events 7Accident and Casual Factor Events
Wesley D. Herron
Waldorf University
Accident And Causal Factor Events
The causal factor and events chart describe a graphical and written description for the time sequence of events contributing the various events related to an accident. The construction of the causal factor and events factor chart helps investigators to conduct an in-depth research and identifying the root cause of accidents. The following are charts showing the various elements involved in event charting:
· Condition- This refers to the distinct state that facilitates event occurrence. a condition may be weather, equipment status, the health status of an employee as well as any other factor affecting an event
· Event-Refers to the point in time described by the occurrence of specific actions.
· Accident- describesthe action, state and even condition where a system is not meeting one or more of its design objectives (Oakley, 2012). This includes real accidents and also near misses. This is the main focus of the analysis or even the evaluation.
· Primary event line- These are the main sequences of occurrence that resulted in the accidents. The primary event line gives the basic nature of an occurrence in a logical progression. However, it does not provide the contributing causes, and it often contains the accident but does not essentially end with an accident event. the primary event line comprises of both conditions and events
· Primary events and conditions- the conditions and events described make up the primary event line.
· Secondary event lines. These are a series of occurrences that lead to primary conditions and events. Secondary event lines enhance the development of the primary event line to show all the facilitating elements of an event. Causal factors are often found in secondary event lines, and they have both the events and conditions.
· Secondary events and conditions- these are conditions and events that make up the secondary event line.
Causal factors
These are key conditions and events that when eliminated could have prevented or reduced an accident and its effects (Oakley, 2012). Causal factors include equipment failure, human error and they include activities such as initiating event for an accident, failed safeguards, and having reasonable safeguards that were not provided.
Items of note
These are undesirable events, and conditions that have been identified in an analysis and they have to be addressed or corrected. However, the events did not contribute to the focus of the accident, and they are shown as separate boxes specifically outside the event chain.
Events and accidents are examined to identify the causes of their event and to decide the moves that must be made to avoid a repeat. It is basic that the accident investigators profoundly investigate into both the events and the conditions ...
Storm Prediction data analysis using R/SASGautam Sawant
• Performed data cleaning and analysis using R, SAS to predict financial loss caused due to storms also predict when a storm will occur depending upon previous storm data
• Implemented algorithms like Logistic Regression, Multiple Regression, Linear Discriminant Analysis, PCA to obtain insights from the Storm Dataset from 1950-2007
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting.
Intervention Analysis for Evaluating The Impact of Policy to Tackle The Incre...Bayu imadul Bilad
Coronavirus disease 2019 (COVID-19) is a disease that is currently endemic almost all over the world, with the name Severe Acute Respiratory Syndrome Coronavirus-2 (SARS- COV2). The origin of this virus originated from Wuhan, China, at the end of December 2019. The impact of COVID-19 is not only on death but also impacts on the country's economy. If not treated immediately, it will cause even more casualties and other losses. In this study, an analysis of the impact of interventions will be carried out as a policy undertaken by several countries in tackling the COVID-19 virus and predicting the increase in positive cases of COVID-19 for the next few days. The countries are Indonesia, South Korea, and Singapore. In this study, the COVID-19 case prediction model uses the multi intervention analysis method with Box-Jenkins ARIMA. The result of this research is the policies taken by the government do not significantly influence the decrease cases of COVID-19. In South Korea, the policy to enforce the Massive Rapid Test significantly and permanently can reduce the increase in positive cases of COVID-19. On the other hand, the policy to open several public sectors significantly and temporarily can increase the addition of positive cases of COVID-19 in South Korea 5 times. In Singapore, the policy to enforce lockdown and social distancing can reduce the increase in positive cases of COVID-19 until the second intervention occurs. The second intervention has a direct effect and the effect of this decrease is continue until the last observation. it is not significant enough to increase the positive cases of COVID-19. In Indonesia, the policy for conducting Large-Scale Social Restrictions (PSBB) has a significant effect in the third period after the first intervention. The effect of this intervention makes the addition of positive cases of COVID-19 in Indonesia more stable, which is the increase of positive cases of COVID-19 in approximately 300’s. Keywords: COVID-19, ARIMA model, Intervention Analysis
A time series is a progression of information focuses listed in time request. Normally, a time series is a grouping taken at progressive similarly dispersed focuses in time. In this way, it is a succession of discrete-time information
Option pricing under multiscale stochastic volatilityFGV Brazil
The stochastic volatility model proposed by Fouque, Papanicolaou, and Sircar (2000) explores a fast and a slow time-scale fluctuation of the volatility process to end up with a parsimonious way of capturing the volatility smile implied by close to the money options. In this paper, we test three different models of these authors using options on the S&P 500. First, we use model independent statistical tools to demonstrate the presence of a short time-scale, on the order of days, and a long time-scale, on the order of months, in the S&P 500 volatility. Our analysis of market data shows that both time-scales are statistically significant. We also provide a calibration method using observed option prices as represented by the so-called term structure of implied volatility. The resulting approximation is still independent of the particular details of the volatility model and gives more flexibility in the parametrization of the implied volatility surface. In addition, to test the model’s ability to price options, we simulate options prices using four different specifications for the Data generating Process. As an illustration, we price an exotic option.
Ongoing Master Thesis by Cristina Tessari and Caio Almeida;
EPGE Brazilian School of Economics and Finance.
http://www.fgv.br/epge/en
How the information content of your contact pattern representation affects pr...
NEW Time Series Paper
1. 1
Annual IL Tornado Count
Katie Ruben
April 22, 2016
As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a
cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In
addition, for this violently rotating column of air to be considered a tornado, the column must
make contact with the ground. When forecasting tornados, meteorologist’s look for four
ingredients in predicting such severe weather. These ingredients are when the “temperature and
wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for
a tornadic thunderstorm to occur[1].”
Tornado’s are an important aspect of living in certain areas of the country. They can cause death,
injury, property damage, and also high anxiety in many people who choose to live in areas prone
to Tornados. In particular, my project will deal with looking at the number of annual tornados
that have occurred in Illinois since 1950. Meteorologists are interested in improving their
understanding of the causes of tornados as well as when they are to occur. The data used during
this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data I
am looking at contains a tornado count from 1950 to 2015 for every state in the United States. I
choose to look strictly at Illinois in this date range. When doing so, I ended up with 2406
tornados over those 65 years. In particular, I am interested in forecasting the number of tornados
that will occur in subsequent years based on the time series data I have found.
In order to analyze the Illinois Tornado Count times series data I will first look to see if the data
is stationary or non-stationary. I will apply the Dickey Fuller test to determine this. Depending
on this outcome, it will help me determine which set of time series models I will want to
continue with. Depending on my original data set, I will want to perform transformations that
reduce large variance as well as take care of any explosive behavior in the data. Upon doing so, I
will then be able to generate preliminary models using ACF, PACF, EACF, and the ARMA
subset. Once several potential models have been chosen, I will fit these models by estimating
parameters using the maximum likelihood method. In addition, perform a residual analysis on
my fitted models and make sure, to the best of my ability, that the models are from the normal
distribution, are independent, and have constant variance. In order to achieve this, I will look at
the KS Test, SW Test, and QQ-plot for normality, runs test and sample autocorrelation function
for independence, and finally the BP Test for constant variance. I will continue building an
appropriate model by looking for outliers and adjusting my models based on residual analysis.
The final step is to perform a forecasting of my data set into the future. I will compare my
forecast with my actual data set to see how accurate my model has become.
[1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from
http://www.spc.noaa.gov/faq/tornado/
[2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from
http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)
2. 2
1 Background
Tornado’s are an important aspect of living in certain areas of the country. They can cause death,
injury, property damage, and also high anxiety in many people who choose to live in areas prone
to Tornados. In particular, this project will deal with looking at the number of annual tornados
that have occurred in Illinois since 1950. Meteorologists are interested in improving their
understanding of the causes of tornados as well as when they are to occur. The data used during
this simulation comes from the National Oceanic and Atmospheric Administration [2]. The data
being investigated contains a tornado count from 1950 to 2015 for every state in the United
States. I choose to look strictly at Illinois in this date range. When doing so, I ended up with
2406 tornados over those 66 years. In particular, I am interested in forecasting the number of
tornados that will occur in subsequent years based on the time series data found.
As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a
cumuliform cloud or underneath a cumuliform cloud, and often visible as a funnel cloud.” In
addition, for this violently rotating column of air to be considered a tornado, the column must
make contact with the ground. When forecasting tornados, meteorologist’s look for four
ingredients in predicting such severe weather. These ingredients are when the “temperature and
wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for
a tornadic thunderstorm to occur [1].”
Prior to starting this time series data analysis, we will split our 66 year observations into two
sets; training data and validation data. The training data set will contain 60 years (1950-2009),
while the validation data set will contain 6 years (2010-2015). The validation set contains 9% of
the total observed years. Keep in mind that over these 66 years there were 2,406 tornado
sightings in Illinois.
In this paper, we begin by using our time series data set to perform preliminary transformations
on the training set to ensure stationarity. If the data set shows non-stationary behavior, I will go
through several different transformations in section 2 of this paper. Section 2 will also contain
the model identification process for several time series models as well as estimation and residual
analysis. Since the training data contains 56 observations, the ideal maximum lag recommended
by the autocorrelation of residuals is 𝑘 = ln(56) ≈ 4. This will become important as we work
through this data set. In section 3, we will focus on model validation and choosing which of our
models is most accurate as well forecasting. In section 4 we will focus on a discussion of our
results from this project on IL tornado counts between 1950-2015.
3. 3
2 Training Data Transformations
2.1.1 Training Data
To begin our model building process, we start by examining the training data set. The time series
plot of our Illinois Annual Tornado Count is shown in figure 1. This plot suggests that there is
extremely large variance as well as an explosive behavior being demonstrated as time passes.
This suggests that our time series data set is non-stationary. However, we need to conduct some
formal testing.
Figure 1: Training Data Time Series & Scatter Plot
The Dickey Fuller Test is used in order to determine if the data set is stationary or non-
stationary. The null hypotheses states that 𝛼 = 1 then there is a unit root and the time series in
non-stationary. The alternative hypotheses states that 𝛼 < 1 then the time series is stationary.
If the time series is non-stationary, then it is suggested to take the difference. Throughout this
paper, we will be concerned with a significance level of .05.
Time Series Plot of Annual Tornados in IL
Time
AnnualTornadosinIL
1950 1960 1970 1980 1990 2000 2010
020406080100120
0 20 40 60 80 100 120
020406080100120
Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados
Previous Year Tornado Count
TornadoCountthisYear
4. 4
2.1.2 Transformed Training Data
2.1.2.1 Log(Training Data)
Before, we check for stationarity, we need to try to eliminate the large variance seen in our data.
To do so, we take the natural logarithmic transformation of the training time series data. The plot
is shown in Figure 2. As seen if Figure 2, there still exists large variance, as well as an explosive
behavior. It is suggested now to look at a Box Cox Transformation of the logarithm
transformation to see if we can come up with a better representation for our data set. However,
when using the Box Cox transformation in R, we get that 𝜆 = 0. This suggests that no
transformation is needed.
Figure 2: Natural Logarithm Transformation Training Data Plot
2.1.2.3 Difference on Logarithm of Preliminary Data
The final transformation used to attempt to remove the explosive behavior is to difference the
training data. As seen in Figure 3, the time series plot with this transformation looks much better.
The explosive behavior has dissipated. Looking at the scatter plot of this transformed data in
Figure 4, we see that 𝑌1
𝑣 𝑠. 𝑌167
shows a negative correlation, 𝑌1
𝑣 𝑠. 𝑌168 shows either a slight
negative correlation or no correlation and 𝑌1
𝑣 𝑠. 𝑌169 shows no correlation. Investigation of these
plots suggests that we may have a time series model of order 1. We will conduct formal model
selections next.
Time Series Plot of Annual Tornados in IL
Time
log(AnnualTornadosinIL)
1950 1960 1970 1980 1990 2000 2010
1.52.02.53.03.54.04.5
5. 5
Figure 3: Difference Training Data Time Series Plots
Figure 4: Difference Training Data Scatter Plots
There no longer seems to be an apparent explosive behavior in the times series plot when taking
the difference log transform. This suggests stationarity in our transformed training data.
However, a formal Dickey Fuller Test must be applied. In doing so, we get a 𝑝 − 𝑣𝑎𝑙𝑢𝑒 = .01 <
.05 = 𝛼 for lag = 1,2 for non-constant, constant, and linear trends. Therefore, since the p-value is
less than the significance level we reject the null hypothesis and our model is suggested to be
stationary. A sample R output is shown in Appendix A.
Time Series Plot of Annual Tornados in IL Diff(log(t.data))
Time
AnnualTornadosinIL
1950 1960 1970 1980 1990 2000 2010
-1.5-1.0-0.50.00.51.01.5
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1.5-1.0-0.50.00.51.01.5
Scatterplot of # IL Tornados
Previous Year Tornado Count
TornadoCountthisYear
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1.5-1.0-0.50.00.51.01.5
Scatterplot of # of IL Torndaos
2 Years ago Tornado Count
TornadoCountthisYear
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-1.5-1.0-0.50.00.51.01.5
Scatterplot of # of IL Torndaos
3 Years ago Tornado Count
TornadoCountthisYear
6. 6
2.1.3 Preliminary Model Building
Now that we have figured out how to eliminate the large explosive behavior in our data set, we
can begin to look at finding preliminary models to build an appropriate times series model from.
In order to do this, we will look at the ACF plot for potential Moving Average models, PACF
plot for potential Autoregressive models, EACF chart for potential Mixed Process models, and
ARMA subset selections. Figure 5 shows the ACF, PACF and EACF plots for our difference
model chosen above.
Figure 5: ACF, PACF, & EACF
The ACF plot suggests that with our data set, a Moving Average of order 1 may be a potential
model. The PACF plot suggests that a Autoregressive of order 1 may be a potential model. One
aspect to keep in mind is that the PACF should show lags that exponentially decay theoretically.
The PACF plot for this data set does not follow this exponential decaying pattern. Therefore, an
Autoregressive model may not be the best suited model. The EACF plot suggests again an
AR(1) We can also determine the best potential model by looking at the ARMA subset based on
BIC or AIC values. This output is displayed in Figure 6. The maximum number of lags allowed
is 𝑘 = ln(61) ≈ 4 based on the autocorrelation of residuals recommendation from literature on
the topic. This output suggests that the best model for my data would be MA(1) with an intercept
term. This is the same suggestion made by the ACF plot. The second best suggestion would be
an ARMA(1,1) process. Throughout the rest of this project, I will work with the following
processes; ARI(1,1), IMA(1,1), and ARIMA(1,1,1).
5 10 15
-0.4-0.3-0.2-0.10.00.10.2
Series diff(log(t.data))
Lag
ACF
5 10 15
-0.4-0.3-0.2-0.10.00.10.2
Lag
PartialACF
Series diff(log(t.data))
7. 7
Figure 6: ARMA Subset BIC
2.1.3.1 Estimations
Using maximum likelihood estimates, we were able to come up with suitable models for
ARI(1,1), IMA(1,1), and ARIMA(1,1,1). When working with time series models, it is best to
always choose a simple model to explain your data. Figure 7 displays the following estimates for
each of these model. Note that the intercept terms were not significant enough at a level of .05,
so they don’t need to be included in the models. We determined the significance by looking at
the output in R for the estimations and found that when looking at the ratio of the intercept
coefficient to the standard error of that coefficient, the value was close to zero when compared to
the critical value of 1.96. This indicated that the intercept of the coefficient was not significantly
different from zero since the ratio of estimation per standard error was less than 1.96. Note that
my models were estimated using the log data.
ARI(1,1): 𝑌1 = −0.4447 𝑌167 + 𝑒1
IMA(1,1):
𝑌1 = 𝑒1 − (−.5491) 𝑒167
ARIMA(1,1,1): 𝑌1 = .3370 𝑌167 + 𝑒1 − (−.8658)(𝑒167)
Figure 7: Model Estimates
2.1.3.1.1 Outliers
Before proceeding further, we must determine if there exist outliers for each of our potential
models. In R, we ran the additive outlier and innovational outlier commands. Both commands in
R, (AO and IO detect), for each model confirmed that there did not exist an outlier in any of the
three models. Therefore, we can continue with our residual analysis.
BIC
(Intercept)
test-lag1
test-lag2
test-lag3
test-lag4
error-lag1
error-lag2
error-lag3
error-lag4
17
14
10
6.9
3.4
0.45
-2.1
-4.7
8. 8
2.1.3.2 Residual Analysis
The next step is to look at the residuals of our three models. From the residuals, we can talk
about normality, constant error variance, and independence. Make note that from the original
training data, there was large variance. Throughout transformations we were able to fix the
stationarity, but so far we assume that the residuals will show there still exists large variance.
Thus, we are also assuming there still exists non-normality and non-independent. However, we
will conduct a formal test on our three models for each of these characteristics.
As seen in Figure 8, the QQ plots do not suggest strong normality. In all three models there
appears to be heavy tails and the QQ normal line does not align along our data points as well as
we would wish. In our opinion, the AR(1) model has the best looking QQ plot to display
normality. To verify this conclusion, we will conduct a KS test and Shapiro Wilks test that can
be found in Appendix A. With a significance level of .05, we must fail to reject the null
hypothesis in every test for normality using the KS and Shaprio Wilks tests for each of our three
models. In each model, the p-value is greater than the significance level. This means that we are
able to assume that our data is from the normal distribution.
Figure 8: QQ Plots of Models
Next we will look at constant error variance. This can be seen in Figure 9. As seen in the three
plots, there appears to be large variances across the horizontal line y=0. However, the plot for
each model does appear to resemble white noise. Thus, we can assume there is possible constant
variance for each model. I was unable to perform a BP or BF test on this data because I did have
the necessary x-variable to regress my residuals on and hence R would not produce these tests
for me.
-2 -1 0 1 2
-1.5-1.0-0.50.00.51.01.5
ARI(1,1) QQ Plot
Theoretical Quantiles
SampleQuantiles
-2 -1 0 1 2
-1.0-0.50.00.51.01.5
IMA(1,1) QQ Plot
Theoretical Quantiles
SampleQuantiles
-2 -1 0 1 2
-1.0-0.50.00.51.0
ARIMA(1,1,1) QQ Plot
Theoretical Quantiles
SampleQuantiles
9. 9
Figure 9: Error Variance Analysis of Models
Finally, we will look to see if our three models are independent. To do this, we will use a Runs
Test for each model. Based on the runs test seen in Appendix A for each model, we conclude that
our transformed data is assumed to be independent because the p-value is greater than the
significance level of .05. Therefore, we fail to reject the null hypothesis where the null
hypothesis states that the data is independent. Another method to test if the data is independently
distributed is to look at the Ljung-Box test. The null hypothesis of this test is that the data is
independently distributed. This means that the Ljung-Box test is testing whether the
autocorrelations of the time series are different from zero or not. Based on the results found in
Appendix A, we can conclude that we fail to reject the null hypothesis. The p-value is greater
than the significance level for each of our three models. Finally, we can confirm independence
once more, we can look at the ACF plot of the residuals for each model as seen in Figure 10.
Since the lags are all within the blue cut off lines, we assume that the residuals resemble white
noise and thus our residuals are independent.
Figure 10: ACF Residuals
ARI(1,1)
Time
Standardizedresiduals
1950 1960 1970 1980 1990 2000 2010
-2-1012
IMA(1,1)
Time
Standardizedresiduals
1950 1960 1970 1980 1990 2000 2010
-1012
ARIMA(1,1,1)
Time
Standardizedresiduals
1950 1960 1970 1980 1990 2000 2010
-1012
5 10 15
-0.2-0.10.00.10.2
Sample ACF of Residuals from ARI(1,1) Model
Lag
ACF
5 10 15
-0.2-0.10.00.10.2
Sample ACF of Residuals from IMA(1,1) Model
Lag
ACF
5 10 15
-0.2-0.10.00.10.2
Sample ACF of Residuals from ARIMA(1,1,1) Model
Lag
ACF
10. 10
Therefore, we have been able to transform our training data into a data set that shows normality,
constant variance and independence. This is normally not an easy task. However, since my data
set only consisted of 66 years it was doable for such a small data set.
3 Model Validation
3.1.1 Confirmation of Models (Over fitting & Parameter Redundancy)
Now we will look at confirming that the three suggested models are good models for our data set
by extending the parameters for each. If the estimate of the additional parameter is not
significantly different from zero and the estimates for the original model do not change
significantly from their original estimates, then we can confirm that our model is a good fit. We
will be concerned with a significance level of .05. For which if the ratio of the estimated
coefficient per its standard deviation is less than the critical value of 1.96, then we will assume
that the coefficient is not significantly different from zero.
Model ARI(1,1) ARI(2,1)
𝝓 𝟐 : s.e.
−. 𝟏𝟏𝟐𝟒
. 𝟏𝟐𝟗𝟖
= −. 𝟖𝟔𝟓𝟗 < 𝟏. 𝟗𝟔
∴ 𝐧𝐨𝐧-‐‑ 𝐬 𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
𝝓 𝟏 : s.e. -0.4447 0.1155 -0.4970 0.1298
𝝓 𝟐 : s.e. -0.1124 0.1298
𝝓 𝟑 : s.e.
𝝓 𝟒 : s.e.
𝑨𝑰𝑪 125.67 126.93
Model IMA(1,1) IMA(1,2)
𝜽 𝟐 : s.e.
−
. 𝟎𝟒𝟕𝟓
. 𝟏𝟕𝟐𝟖
= −. 𝟐𝟕𝟒𝟖 < 𝟏. 𝟗𝟔
non-𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
𝜽 𝟏 : s.e. -0.5491 0.1358 -0.5527 0.1352
𝜽 𝟐 : s.e. -0.0475 0.1728
𝜽 𝟑 : s.e.
𝜽 𝟒 : s.e.
𝑨𝑰𝑪 124.04 125.97
Model ARIMA(1,1,1) ARIMA(1,1,2) ARIMA(2,1,1)
𝝓 𝟏 : s.e. 0.3370 0.1698 -.9009 0.2298 0.3189 0.1461
𝜽 𝟏 : s.e. -0.8658 0.1027 .3474 0.2712 -0.9114 .0714
𝜽 𝟐 : s.e. -.4464 0.2057
𝝓 𝟐 : s.e. .2174 .1388
𝑨𝑰𝑪 124.97 127.71 124.52
𝜽 𝟐 : s.e.
−. 𝟒𝟒𝟔𝟒
. 𝟐𝟎𝟓𝟕
= −𝟐. 𝟏𝟕𝟎𝟏 > 𝟏. 𝟗𝟔
𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
𝝓 𝟐 : s.e.
. 𝟐𝟏𝟕𝟒
. 𝟏𝟑𝟖𝟖
= 𝟏. 𝟓𝟔𝟔 < 𝟏. 𝟗𝟔
𝐧𝐨𝐧 − 𝐬𝐢𝐠𝐧𝐟𝐢𝐜𝐚𝐧𝐭
If we continue to increase the order for the ARIMA model, the AIC value will continue to get
larger. Generally, a small AIC value, indicates the best model. It appears that ARIMA(1,1,1) will
be the best model for a mixed process based off of AIC values.
11. 11
In addition, when looking at our three suggested models with the over fitting procedure, it
appears that ARI(1,1) and IMA(1,1) are confirmed to be good models for this time series. With
ARIMA(1,1,1), we were unable to confirm this process as a good model due to ARIMA(1,1,2)
having a significant coefficient for 𝜃8. In addition, the ARIMA estimates for the over fitting were
not close to the original estimates for this model. Therefore, once again, we can cannot confirm
this model as a good fit for our data.
Therefore, as we continue with forecasting, I will be looking at the models ARI(1,1) and
IMA(1,1) for my log data set.
3.1.2 Forecasting
The final procedure is to identify which of the models remaining would be the best predictor of
annual tornados. We will be forecasting values for my testing data set. Recall, that we initially
pulled out 6 years from the end of our data. Now we can test if our model is accurate. As seen in
figure 11, we were able to make our predictions using R, which are displayed with the red dots.
These predictions were made by using the one step forward forecasting procedure discussed in
class. As you can tell, it is not perfect. This is partly due to the fact that this data set only
contains 66 data points and these 66 points may need to be modeled using a different technique.
However, we are using the techniques demonstrated in class.
Figure 11: Log data Forecast
In order to determine which model has better prediction capabilities, we will look at MSE, MAP,
and PMAD. The smaller the values, the better the prediction abilities. Therefore, by looking at
the table below we can determine that IMA(1,1) would have the best predicting capabilities
based on this recommendation.
Forecasting with ARI(1,1)
Time
log(t.data)
0 10 20 30 40 50 60
1.52.02.53.03.54.04.5
Forecasting with IMA(1,1)
Time
log(t.data)
0 10 20 30 40 50 60
1.52.02.53.03.54.04.5
12. 12
ARI(1,1) IMA(1,1)
MSE 0.0434644 MSE 0.03533288
MAP 0.0450561 MAP 0.03748878
PMAD 0.04383421 PMAD 0.03675122
Smaller Values
As you can see, the values are not as small as we would like. However, with the data at hand, this
seems to be pretty good. All coding used to perform my predictions are seen Appendix A.
Now that we have chosen IMA(1,1) as our best model, we will transform the data back to its
original meaning. The data in Figure 12 resembles the predictions made when working on the log
data. The data in Figure 13, resembles the predictions made when transforming the data back to
its original meaning. Both figures also contain the prediction intervals for the corresponding data
sets.
Figure 12: Log data Prediction Values & Interval
Figure 13: Original data Prediction Values & Interval
Figure 13 shows that three of our 6 values were predicted fairly close to the actually value.
However, there is a lot of variability still in our model when predicting. When looking at the
prediction intervals for our original data set after transforming back, we see that they are very
large. This means that the predicting capability of IMA(1,1) is not extremely good. The
complete list of the 95% prediction intervals for the entire original data set is seen in Appendix
A. Figure 14 displays the final graph our time series data that compares the original data to the 6
predicted values.
13. 13
Figure 14: Time Series Plot with Predictions (Original Data)
IMA(1,1):
𝑌1 = 𝑒1 − (−.5491) 𝑒167
4 Discussion
The goal of this analysis was to use knowledge of Time Series models to predict the future count
of tornados in IL every year. We started with 2,406 tornado sightings in Illinois from 1950 to
2015. However, we broke our data into yearly data. Therefore, we had 66 data points. This was a
fairly small data set to perform a Time Series analysis on. Keep this in mind as we continue to
discuss our results.
We began our analysis with a training data set. This training data set contained the first 60 years
of our analysis. We set aside the last 6 years for our testing data set which became extremely
important when we performed our forecasting’s. Our first goal was to ensure that our data was
stationary. In order to do this, we had to perform the log transformation as well as take the
difference of our data. The log transformation helped to reduce the variability seen in our
original data set while the differencing allowed us to remove the explosive behavior seen in the
original time series plot. We were able to ensure stationarity of our data set by performing the
Dickey Fuller Test.
Once we had a stationary data set, we were able to begin the estimation process. We looked at
the ACF, PACF, EACF, and best subset selection chart in order to determine which models
0 10 20 30 40 50 60
020406080100120
Time Series Plot of Annual Tornados in IL
1950 - 2015
AnnualTornadosinIL
14. 14
would be best. We came to the conclusion that an ARI(1,1), IMA(1,1) and ARIMA(1,1,1) would
all be suitable models at this point. Next, we performed a residual analysis of all three models.
As discussed in the paper, all three models were shown to have normality, constant error
variance, and were independent. This is primarily due to the small sample size of our data set.
When a data set is small or extremely large, these three characteristics are a lot easier to achieve.
However, when we performed an over fitting on all three of these models, the ARIMA(1,1,1)
was proved to be non-sufficient for this data set. Therefore, as we continued forward with our
project, we focused only on the ARI(1,1) and IMA(1,1).
Finally, we forecasted values for our testing data set using both ARI(1,1) and IMA(1,1). In doing
so, we calculated the MSE, MAP, and PMAD for each of the models. We found that the
IMA(1,1) model had the smallest numerical value in all three of these tests. This meant that for
our data set, IMA(1,1) was the best model. However, make note that the values for these
criterion are not as small as we would have wished. The smaller the value, the better the
predictions. As seen in the final time series plot shown in figure 14, our predictions are far from
perfect. In all cases, it seems that our predictions are being overestimated from the actual values.
In some cases, for example in 2012, this overestimation is drastic.
In order to further improve our models, we may need to try other time series models than those
that were discussed in class. In addition, to better predict tornados in Illinois we may have
wanted to break down our data set into quarters of the year. Clearly tornados are more frequent
in the spring and summer months. In using a different division of time for this data set, we would
have had a larger number of data points for which we could create different time series models
from. When looking at the data set we did use for this project, the large variance over time in the
count of tornados could be due to the number of people who are actually out in Illinois counting
them. In the early years of this data set, tornado counts may be skewed down as people may not
have been tracking them as much as we do in 2015. In addition, the number of tornados
increasing over time could be due to global warming or environmental effects.
In the end, this analysis shows that the model chosen to represent this data was relevant, but
could have been better. As stated before, the predictions were continually overestimated. In the
future, we would like to go back and test other potential models that were not discussed in this
course in order to better predict the annual tornado count in Illinois.
15. 15
Appendix
A Reference for Model Building
A.1 Training Data Transformation Codes
* All codes used for this project are appended at the very end of this paper.
A.2 Model Selection
Dickey Fuller Test on Diff(log(t.data)
16. 16
Model Estimations
A.2.1 Residual Analysis Codes
Normality Test:
Hb: data
is
normal
Hl: data
is
not
normal
Since
the
p − value
in
all
three
cases
is >
.05.
Therefore, fail
to
reject
Hb.
17. 17
Independence Test:
Hb: data
is
Independent
Hl: data
is
not
Independent
Since
the
p − value
in
all
three
cases
is >
.05.
Therefore, fail
to
reject
Hb.
Ljung-Box Test:
Hb: data
is
independent
𝑟7 = 𝑟8 = ⋯ = 𝑟{ = 0
Hl: data
is
not
independent
Since
the
p − value
in
all
three
cases
is > .05.
Therefore, fail
to
reject
Hb.
20. 20
Code:
### Illinois Tornado Annual Count Updated (Left out 6 data points)
library(TSA)
library(fUnitRoots)
data1<-read.csv(file="IL Total Data.csv",header=FALSE,sep=",")
x1<-data1[,2]
y1<-data1[,1]
y_train<-y1[1:60]
y_test<-y1[61:66]
t.data<-ts(y_train,freq=1,start=c(1950,1))
t.data1<-ts(y_test,freq=1,start=c(2010,1))
k.data<-ts(y1,freq=1,start=c(1950,1))
#Original Time Series Plot
plot(t.data,ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual Tornados in
IL',type='o')
plot(y=t.data,x=zlag(t.data),ylab='Tornado Count this Year',xlab='Previous Year Tornado
Count',main='Scatterplot of # of IL Torndaos vs Last Years # of IL Tornados')
#log Transform
plot(log(t.data),ylab='log(Annual Tornados in IL)',xlab='Time',main='Time Series Plot of Annual
Tornados in IL',type='o')
acf(log(t.data))
pacf(log(t.data))
eacf(log(t.data))
AR/MA
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x x o o o o o o o o o o o o
1 x o o o o o o o o o o o o o
2 o o o o o o o o o o o o o o
3 x o o o o o o o o o o o o o
4 o o o o o o o o o o o o o o
5 x o o o o o o o o o o o o o
6 x o o o o o o o o o o o o o
7 o o o o o o o o o o o o o o
#First Difference Log
plot(diff(log(t.data)),ylab='Annual Tornados in IL',xlab='Time',main='Time Series Plot of Annual
Tornados in IL Diff(log(t.data))',type='o')
#Test for Stationarity
adfTest(diff(log(t.data))),lags=1,type=c("nc"))
adfTest(diff(log(t.data)),lags=1,type = c("c"))
adfTest(diff(log(t.data)),lags = 1, type = c("ct"))
21. 21
#Model Building
acf(diff(log(t.data))) #Suggests MA(1)
pacf(diff(log(t.data))) #Suggests AR(1)
eacf(diff(log(t.data))) # suggests AR(1)
AR/MA
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 x o o o o o o o o o o o o o
1 o o o o o o o o o o o o o o
2 x o o o o o o o o o o o o o
3 x o o o o o o o o o o o o o
4 x o o o o o o o o o o o o o
5 x x o o o o o o o o o o o o
6 x o o o o o o o o o o o o o
7 o o x x x o o o o o o o o o
#Best Subset suggests MA(1) as best, then ARMA(1,1)
sub1<-armasubsets(diff(log(t.data)),nar=4,nma=4,y.name='test',
ar.method='ols')
plot(sub1)
#Scatter PLot Comparison
par(mfrow = c(1, 3),pty = "s")
plot(y=diff(log(t.data)),x=zlag(diff(log(t.data))),ylab='Tornado Count this Year',xlab='Previous
Year Tornado Count',main='Scatterplot of # IL Tornados')
abline(0,0)
plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=2),ylab='Tornado Count this Year',xlab='2
Years ago Tornado Count',main='Scatterplot of # of IL Torndaos')
abline(0,0)
plot(y=diff(log(t.data)),x=zlag(diff(log(t.data)),d=3),ylab='Tornado Count this Year',xlab='3
Years ago Tornado Count',main='Scatterplot of # of IL Torndaos')
abline(0,0)
#Fitting Models
AR1<-arima(log(t.data), order = c(1, 1, 0),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
MA1<-arima(log(t.data), order = c(0, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL, method =
'ML')
ARMA11<-arima(log(t.data), order = c(1, 1, 1),xreg = NULL, include.mean = TRUE,init = NULL,
method = 'ML')
#No outliers were detected.
detectAO(ARMA11)
detectIO(ARMA11)
#Resdiual Anaysis
tsdiag(AR1,gof=4,omit.initial=F)
tsdiag(MA1,gof=4,omit.initial=F)
tsdiag(ARMA11,gof=4,omit.initial=F)
#Normality
op <- par(mfrow = c(1, 3),pty = "s")
qqnorm(residuals(AR1),main='ARI(1,1) QQ Plot')
qqline(residuals(AR1),col='red')
qqnorm(residuals(MA1),main='IMA(1,1) QQ Plot')
qqline(residuals(MA1),col='red')
qqnorm(residuals(ARMA11),main='ARIMA(1,1,1) QQ Plot')
qqline(residuals(ARMA11),col='red')
26. 26
References
[1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from
http://www.spc.noaa.gov/faq/tornado/
[2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from
http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)