1. CLASSIFY DEADLY TORNADOS
Miranda Henderson & Katie Ruben
Mat 443 – Fall 2016
ABSTRACT
In order to analyze the occurrence of injuries or fatalities in this dataset, we will have to create a label. This label is
formed from looking at two columns provided in the raw dataset for fatalities and injuries. We will create a label
column by designating a value of one to tornado occurrences that caused death or injury and a zero otherwise. The
next step would be to clean the data. This dataset has no missing values; however, we will want to make sure that
the features provided will be beneficial to the classification problem. A crucial aspect of this investigation is to
identify features which have multicollinearity. Removing these features will allow these classification methods to
work properly. we will perform feature selection and come up with several sets of features to try out on various
classification models. It will be important to also determine the existence of outliers and high leverage points.
Several models that we will use are KNN, LDA, QDA, logistic regression, multiple linear regression, random forest,
SVM, and neural network. In order to determine the best model, we will rely on looking at the ROC plots for each
as well as calculating the AUC, sensitivity, accuracy, and MSE. It is important to have a high sensitivity rate. We
will also research other machine learning techniques and methods to help predict the number of fatalities or injuries
caused by tornadoes.
INTRODUCTION
Tornados are an important aspect of living in certain areas of the country. They can cause death, injury, property
damage, and also high anxiety for those who choose to live in areas prone to tornados. Meteorologists are interested
in improving their understanding of the causes of tornados as well as when they are to occur and their severity. Here,
we look at the number of annual tornados that have occurred in Illinois since 1950. The data used during this
simulation comes from the National Oceanic and Atmospheric Administration [2]. The original dataset contains
tornados from 1950 to 2015 for every state in the United States. We restrict the data to include only Illinois and its
boundary states; this condenses our new dataset to contain 9582 observations. In particular, the goal of this
investigation is to predict if a tornado will cause death and injury to humans or not based on characteristics of past
tornadic behavior.
As stated by the NOAA [1], “a tornado is a violently rotating column of air, suspended from a cumuliform cloud or
underneath a cumuliform cloud, and often visible as a funnel cloud.” In addition, for this violently rotating column
of air to be considered a tornado, the column must make contact with the ground. When forecasting tornados,
meteorologist’s look for four ingredients in predicting such severe weather. These ingredients are when the
“temperature and wind flow patterns in the atmosphere cause enough moisture, instability, lift, and wind shear for a
tornadic thunderstorm to occur [1].”
In order to have a better understanding of the data, it is important to note that the features being considered consist
of the 18 following descriptions; month of occurrence, F-scale rating, property loss amount, crop loss, start latitude
and longitude, end latitude and longitude, length in miles of the tornado, width in yards of the tornado, number of
states affected, if the tornado stays in the state it started in, IL, and the 5 surrounding states (IA, IN, KY, MI, and
WI). Here, a tornado that has caused injury or death, is represented as a positive classification, 1. The label for this
data set is called fatality and injury. This column was created through several data preparation steps in R, see
Appendix A. The data consist of an imbalanced label, where 8159 observations are classified as 0 and 1423
observations are classified as 1.
2. APPROACH TO ANALYSIS
Currently, no analysis on this specific data set has been performed to do classification of human harm. With that
being said, we implement several data mining methods to analyze this problem. In addition, we identify outliers and
high leverage points during the regression analysis. During our analysis, we find this data set does contain several
limitations.
The primary limitation is due to the unbalanced proportions of the classes. When considering the full data set, the
positive class only comprises 17% of all observations. This may cause our models to have a very low sensitivity
rate at the 5% cut-off value for classes. When performing data analysis on classification problems, it is desirable to
have a higher sensitivity rate. In the context of this specific problem, we would rather be more accurate about telling
the public if an oncoming tornado will induce harm to humans so that they can take the appropriate coverage and
safety. We are less concerned if we tell people to take coverage when not as severe of a tornado is oncoming and
most likely will not produce harm to humans. This would result in a lower specificity rate. Being incorrect in
predictions that results people to take extra safety measures is not bad; however, being incorrect in predictions that
tell people they are not in harms way could be devastating. In any classification problem, there will always be a
trade off between sensitivity and specificity.
Another limitation to this data set is the amount of multicollinearity present between the predictors. When
performing any kind of classification model, it will be extremely important to remove the variables that are
multicollinear. The technique we will use to do this is subset selection using methods of forward, backward and best
subset feature selection, with the leaps package in R (see Appendix A). In addition, we will follow up with these
suggested subsets of predictors by using variance inflation factor to determine if high multicollinearity still exists.
Using this process, we devise 2 subsets of features to use to perform several models of data analysis on. Below is a
list of techniques that will be implement in this analysis of classifying harm to human by tornados.
DATA MINING METHODS IMPLEMENTED IN REPORT
• Regression (Appendix B)
o Multiple Regression
§ Ridge
§ Lasso
• Classification (Appendix C)
o General Linear Model
o LDA: Linear Discriminant Analysis
o QDA: Quadratic Discriminant Analysis
o KNN: K – Nearest Neighbors
• Tree Based Methods (Appendix D)
o Single Classification Tree
o Random Forest
o Boosting
§ Bernoulli
§ Adaboost
• Support Vector Machine (Appendix E)
o Kernel
§ Linear
§ Polynomial
§ Radial
§ Sigmoid
• Neural Network (Appendix F)
3. DATA PREPARATION
The first step to any analysis is to prepare the data. In preparing, any observations with missing in any feature
column were identified and removed. In addition, the formatting of each column was appropriately set to either
numeric, factor, character or string in R. For this specific data, there is a column labeled state, for which takes on the
6 abbreviations of the states included in this analysis. In order to eliminate having a column of factors in the models,
we create an indicator matrix (Figure 1), which correlates these factors to dummy variables accordingly for each
individual state. In doing so, this creates a data frame which consists of all numeric columns besides my label which
is a factor of 0’s and 1’s.
Figure 1: Indicator Matrix of States (First 5 Rows Shown)
No further preparation was required for this data set. The resulting structure of the data can be seen in Appendix A .
We now consider multicollinearity by performing feature selection to determine appropriate feature subsets to be
considered for our models.
Additionally, the condensed dataset that will be used for this analysis consists of 9582 observations. The full data set
is split randomly using R in order to create our training and testing data to use during the creation and testing of our
models. The training and testing data sets contain 4791 observations each.
MULTICOLLINEARITY AND FEATURE SELECTION
The first step taken was to observe the correlation plot of the variables which can be seen in Figure 2. From
this matrix, we see high positive and negative correlations by the linearity, shape and color from this plot.
The strongest positive correlation is with the F-Scale rating with length and width of the tornado’s path. The
strongest negative correlation is with ending latitude and longitude.
The next step taken was to determine a subset of features that were not multicollinear. In doing so, forward,
backward and subset feature selection on all features were performed. These results can be seen in the
Appendix A. Many subsets could have been chosen, but we only consider two sets of features to test with the
models. Subset 0 is the full model with all features, we consider this subset when we perform ridge and lasso
regression because lasso will perform its own feature selection. Subset 1 contains 9 features and Subset 2
contains 6 features, which can be seen in Figure 3. Additionally, we see that the multicollinearity seems to be
acceptable only for Subset 2 and 3 since the VIF<5 for all features.
Figure 2: Correlation Matrix of
Features
Figure 3: Variance Inflation Factor (m0=full, m1=6 features, m2=9 features)
Subset.0=m0, Subset.1=m1, Subset.2=m2
4. METHODS OF ANALYSIS
We begin our model analysis with regression techniques, specifically OLS, ridge, and lasso regression. We follow
up with four classification approaches: logistic regression, linear discriminant analysis, quadratic discriminant
analysis, and K-Nearest Neighbor approach. In order to ensure there is no multicollinearity, we restrict ourselves to
Subsets 1 and 2 identified above. We then implement random forest, support vector machine, and neural network
techniques. In each subsection, we summarize our findings for each technique. All corresponding code can be found
in Appendices B, C, D, E, and F accordingly. For each method that requires tuning, 10-fold cross validation was
performed. In addition, an R loop was created for each method to determine the cut-off value for prediction
probabilities by weighing the significance of sensitivity versus accuracy. The results follow below.
REGRESSION
Ordinary Least Squares
We begin with multiple linear regression using ordinary least square (OLS) estimates. These estimates
produce a model that has low bias and high variance. In R, we created a linear model such that the posterior
probability of the positive class had a cut-off rate of 22% which sends an observation to class 1. In addition,
we looked for observations that may be outliers or have high leverage; these results can be seen in Figure 4.
Although identified, we choose to leave these observations in our model because they are realistic results of
tornados that have happened. The confusion matrix for this model can also be seen in Figure 4.
Ridge
The next method implemented is known as ridge regression, which is similar to OLS regression except now
we are introducing a penalty term which is known as the “L2” norm. This penalty term is used to help control
the amount of variance in the model. Therefore, ridge regression introduces some bias with the reduction in
variance. The interpretability of this model is similar to that of the OLS method.
Ridge regression was performed with all 3
subsets of the data. We normalized our
predictors before running these subsets. In
order to determine the best lambda value for
the penalty term in ridge regression, we use 5-
fold cross validation (see Appendix B). Each
model has a 12% cut-off which provided us
with the best sensitivity rate and also a decent
accuracy rate. A comparison summary of these
three subsets can be seen in Figure 5. Of these
models, ridge regression on Subset 1 performed
the best with the highest AUC, accuracy, and
sensitivity rate. See Figure 7 for AUC values.
Figure 4: OLS Model Outliers/Leverage Points, Cut-Off Rate Determination, ROC/AUC,
Confusion Matrix
Figure 5: Ridge Model Confusion Matrix, Example of Cut-Off boundary
determination.
Subset 0 Subset 1 Subset 2
5. Lasso
The last form of a regression model is lasso regression; this model also includes a penalty term. However,
now it is an “L1” penalty which allows for the model to perform feature selection. Again, in comparison to
the OLS method, lasso will introduce bias and decrease variance. Since this method produces a sparse model,
it is an easier model to interpret in relation to ridge and multiple regression.
Again, we ran 5-fold cross validation to choose lambda, the penalty coefficient. We also normalized each of
our subsets features before running our
models. We also ran this model on all three
subsets and performed the same analysis as
above to determine the best cut-off
percentage for classifying the predictions.
Similarly, lasso has the optimal sensitivity
and accuracy rate at 12%. As seen in Figure
6, the lasso method only has performed
feature selection on Subset 0. Since it did not
do feature selection on the other 2 subsets,
we expect this method to perform worse than
or equal to ridge because the benefit of lasso
is to perform feature selection. As expected,
of these three models shown in Figure 5,
Subset 1 performed the best in relation to
AUC, accuracy, and sensitivity. See Figure 7 for
AUC values.
Regression Discussion
As seen in the ROC plot in Figure 7, the OLS method performed the worst. However, all of the other methods in
relation to AUC as the criterion performed similarly. For this analysis, we are most interested in having a high
sensitivity rate. The model with the best sensitivity rate, accuracy rate and AUC value is ridge with Subset 1 (Figure
7). This model is declared to be the best regression model during this analysis. This confirms that the feature
selection process of the lasso was not beneficial for this data. All R code corresponding to this regression analysis is
seen in Appendix B.
CLASSIFICATION
All classification problems, will be performed on Subsets 1 and 2 only. These models work best when
multicollinearity has been removed. All cutoff boundaries for classes were found through the same looping
structure discussed for regression, comparing the trade-off between accuracy and sensitivity.
Figure 6: Lasso Model Confusion Matrix, Example of Cut-Off boundary determination.
Subset 0 Subset 1 Subset 2
Figure 7: Comparison of Regression Models
6. Logistic Regression
The first designated classification model that will be used is logistic regression. This model is less flexible
than OLS, Ridge, and Lasso which means it will have a lower variance and higher bias. The coefficients of
this model are estimated by the maximum likelihood function. This model has the potential to be good
because it has no assumptions about the
distribution of the data and is also robust to
outliers. One thing to note, is that logistic
regression works best when we have linear
boundaries between classes.
Observing the two models in Figure 8,
Subset 1 has the better AUC value and
sensitivity rate. The accuracy rate is fairly
similar between the models. The class cut-
off boundary for these models were both at
12%. This value was discovered by using a
loop in R to determine an appropriate trade
off between accuracy and sensitivity.
Linear Discriminant Analysis
Linear discriminant analysis (LDA) is another classification method which has linear boundaries. This
method assumes the same covariance matrix between the classes. In addition, it is a less flexible method
resulting in lower variance for a trade off of higher bias. The coefficients of this model are estimated from the
Gaussian distribution. One
limitation of LDA is that it is
not robust to outliers.
When comparing the two
LDA models in Figure 9, it
is seen that Subset 1
performs better with respect
to AUC and accuracy. It has
a 2% lower sensitivity rate,
but this is not drastic. The
class cut-off boundary used
for these models were 9%
and 8% respectively for
Subset 1 and 2.
Quadratic Discriminant Analysis
Quadratic discriminant analysis (QDA) is a method which introduces non-linear boundaries between classes
in a classification problem. It is known that QDA is more flexible than LDA, hence QDA has high variance
and lower bias than LDA. Unlike LDA, QDA assumes different covariance matrices for each class, but the
coefficients are from the Gaussian distribution.
When comparing the two QDA models in Figure 10, it is seen that Subset 1 with a cut-off rate of 5% has an
extremely low sensitivity rate which is not good. The model for Subset 2 with a cut-off rate of 10% has more
acceptable values for AUC, accuracy, and sensitivity than the model for Subset 1 (refer to Figure 10). From
this comparison, having more features was not beneficial for model building using this method.
Figure 8: Logistic Regression
Subset 1 Subset 2
Subset 1 Subset 2
Figure 9: Linear Discriminant Analysis
7. K-Nearest Neighbors
The final version of a classification method
used is K-nearest neighbors (KNN). In this
method, we’re interested in demonstrating
the effect of normalizing the features versus
not. As seen in Figure 11, the normalized
data performed better. In addition, the
normalized data for both subsets produced
the same results for the confusion matrices
and ROC plots. This method is a non-
parametric method that does not rely on any
prior assumptions. KNN has non-linear
boundaries and is more flexible than other
methods. In order to determine the value of
K for the number of nearest neighbors, we
performed 10-fold CV using the caret
package in R. Figure 11, shows that all of
these models perform poorly with respect to
accuracy and AUC. Each of these models
have a 6% cut-off value to determine the
class of the prediction.
Classification Discussion
Based on the four models discussed above, it is clear that this data has linear boundaries. QDA and KNN
performed the worst of the considered methods for each subset. The worst model is QDA for Subset 1 which
shows us a 56% sensitivity rate even with considering a tradeoff between accuracy and sensitivity as
described earlier. As demonstrated in figure 12, we see that both subsets perform similarly for each
respective model when comparing AUC, accuracy, and MSE. In addition, the linear decision boundary
models have performed best. Of these 10 models, the model to consider using would be GLM or LDA for
Subset 1 or 2. In the interest of interpretation of the model, it is always best to choose the model with fewer
features. Thus, Subset 2 would be a good choice. All R code associated to the classification methods can be
seen in Appendix C.
Figure 11: KNN applied to normalized and non-normalized
Subset 1 and Subset 2.
Subset 2 Subset 1
Figure 12: Classification Method Comparison
Subset 1 Subset 2
Figure 10: Quadratic Discriminant Analysis
8. TREE BASED MODELS
All tree based models will be performed using the full number of predictors available, Subset 0. We will investigate
the use of a single tree, random forest, and boosting methods. All corresponding R codes is seen in Appendix D.
Single Classification Tree Method
The advantage of using a tree is that they are easy
to explain to people. In addition, since trees can be
displayed graphically, this allows for almost
anyone to be able to understand their predicting
process. Also, trees are able to handle qualitative
predictors unlike other types of modeling. The one
downfall of a single classification tree is that the
predictions are not as accurate as other methods we
have discussed previously in this paper. Creating a
single tree in this manner makes it extremely
difficult to catch all of the necessary information
that is caught by the other methods.
Through this method, we construct a single tree that is
unpruned and then we will investigate further with this
tree by pruning it back. The unpruned tree can be seen
in Figure 13 with the corresponding confusion matrix.
Pruning the tree will help make the model more
interpretable. In addition, it should help the predicitive
capabilities of our model. The pruned tree can be seen in
Figure 14. We used 10-fold cross validation in order to
determine that our pruned tree should have 4 terminal
nodes. As you can see, prunning the tree gave us the same
confusion matrix as the unprunned tree. In all, prunning has
allowed for a more interpretable model with the loss of information.
Random Forest
The next model we implemented was Random Forest. This method is beneficial for several reasons including its
high predicitive accuracy in real world examples, its ability to handle a large number of predictors, its ability to
perform feature importance identification, and its ability to impute missing data while maintaining accuracy. In
addition, a random forest generates an internal unbiased estiame of the generalization error as the forest is being
build. An important feature of Random Forests is its use of out-of-bag samples which is close to being identical to
N-fold cross validaion. Another benfit of Random Forest is that we can attain the nonlinear estimators by
implementing cross validtion. When using Random Forest, there are several tuning parameters we wanted to
considered. We looked at finding the
optimal number of features to try at each
split with “mtry.” We also looked at finding
the optimal number of trees to create for our
forest by using “ntree.” A loop in R was
used in order to determine these values as
well as using 10-fold cross validation to
determine the value of “mtry”. In R, we
used the command “tune.rf” from the
Random Forest library in order to perform
10-fold cross validation. Our results can be
Figure 13: Unpruned Tree on Left. Optimal Number of Nodes for Pruning is 4 on Right.
Figure 14: Pruned Tree on the Left. Confusion Matrix in the Center. ROC plot of Pruned Tree on the
Right.
Figure 15: Settings for Mtry and Ntree based on loop in R and result from 10-fold CV using tune.rf
on Right.
9. seen below in Figure 15. We have performed bagging on our random forest when we consider both tuning
parameters “mtry” and “ntree.”
Once, we found that the optimal number for
“mtry” was 7 features and for “ntree” was 4,500
trees we ran our predictions in R. In doing so,
we attained the following feature importance to
the right along with the corresponding confusion
matrix and ROC plot in Figure 16. The cut off
value for our predictions to assign the
probabilities to 1 or 0 were found in same
manner as discussed during the classification and
regression modeling. In this method, the cut off
value was at 12%. Refering to the variable
importance plot, we see that the F-scale rating,
property loss, length, starting and ending
latitutdes and longitudes, width, and month are
all of significant importance to this model.
Therefore, Random Forest has selected to use
only 9 features for this model. This selection was
made using the Gini index.
Boosted Random Forest: Bernoulli Distribution
Here, we implement boosting methods using the Bernoulli distribution. Boosting methods grow trees
sequentially by using information from previous trees to grow the proceeding tree. This method essentially
learns as it proceeds instead of fitting one large decision tree. By allowing the slow learning, this can improve
the performance. Here, we have chosen to create a model with a Bernoulli distribution and with the AdaBoost
distribution. Since we have a binary classification problem, we first choose the Bernoulli distribution.
We choose to grow 5000 trees, a shrinkage
value of 0.01, and depth of 4. The results from
this model can be seen in Figure 17. From the
variable importance plot, the F-scale rating and
property loss had the largest importance in the
model. Additionally, we see that 18 predictors
were influential and would be used for the
building of this boosting model using a
Bernoulli distribution. The summary confusion
matrix for the predictions of this model can also
be seen in Figure 17. We see from the confusion
matrix statistics and ROC plot, this model
yielded good results.
Boosted Random Forest: AdaBoost Distribution
In addition to the boosting model in the previous section, we choose to build a boosting model using the
AdaBoost distribution. This method produces a sequence of weak classifiers and final predictions are created
by a weighted majority vote of the classifiers.
Once again, we choose 5000 trees, a shrinkage value of 0.01, and depth of 4, like for the Bernoulli
distribution. The results can be seen below. From the variable importance plot, we see that this model also
Figure 17: Bernoulli Random Forest Statistics
Figure 16: Random Forest Statistics
10. uses 18 predictor variables to construct
the model. These results can be seen in
Figure 18. The confusion matrix statistics
and ROC plot can also be seen, and once
again we see that the performance of this
model was also quite successful.
Tree-Based Model Discussion
Here, we have tried several tree-based models to use for predictions. As we have seen, the relative performance of
these models were good and produced high accuracy as well as large sensitivity values, as seen in Figure 19. We see
that random forests is more beneficial than bagged trees because they decorrelate the trees. Random forests only
consider m features at each split, which helps the decorrelating process. Additionally, we see that the boosting
methods were also successful. Boosting using the Bernoulli distribution produced the highest AUC value, indicating
very good modelling success. From the tree-based methods, boosting with AdaBoost distributed produced the
highest accuracy percent, while random forests produced the largest sensitivity.
SUPPORT VECTOR MACHINE
Support vector machines are more complex than decision trees, which can sometime be beneficial for
classification problems. These algorithms look for a decision boundary that separate members of classes. If
these hyperplanes that separate the classes are not linear, then support vector machines can alter the training
data into a higher dimension by a nonlinear mapping. Here, we use different kernels for support vector
machines to determine their performance accuracy in this setting. In order to tune each model, we split our
original training data into a smaller subset of 638 observations. Due to the size of the data set, tuning on our
full training was not computationally admissible on our laptops. Tuning was done using the package e1071
with 10-fold cross validation. Once an appropriate tuning parameter was discovered for each kernel, we then
split our training and testing data to its original size. Corresponding code for all support vector models is
seen in Appendix E.
Linear Kernel: Support Vector Classifier
We first fit an SVM model with a linear kernel. We
first tune the parameters to determine an appropriate
value for the cost. We find that when we use a value of
10 for cost, we get the smallest error. We create our
linear kernel support vector classifier and determine an
appropriate cut-off value for classification. From the
sensitivity-accuracy-AUC trade-off, the best cut-off
value is at 15%. Using this cut-off value, we perform
our predictions and compute the confusion matrix. The
confusion matrix and corresponding statistics can be
seen in Figure 20. We see that this model performed
fairly well in regards to sensitivity, accuracy, and
AUC. However, this model didn’t produce the best
results in comparison to other models.
Figure 19: Tree Statistics
Figure 20: Linear SVM
Figure 18: AdaBoost Random Forest Statistics
11. Polynomial Kernel:
We now see if a polynomial fit is more appropriate for our data. Once again, we tune the parameters cost,
degree, and gamma to determine which will suffice for the model. We find that the error is minimized when
cost is 1, degree 3, and gamma is 0.1. Using these parameter values, we create a SVM fit with degree 3. The
results from this fitted model are shown in the confusion matrix in Figure 21. We again choose a cut-off value
of 0.15 for our predictions. This model has the highest sensitivity rate we have seen thus far. However, it also
has the lowest accuracy and specificity rates we have seen through our analysis. This model predicts that
almost every tornado to ever occur will be deadly.
Radial Kernel:
We extend our SVM models to a radial kernel. Once again, we tune the parameters in order to minimize the
error. We determine that when cost is 10 and gamma is 0.1, we minimize the error. Similar to the previous
SVM models, we find a cut-off value of 0.15 for this model. The model performance can be seen in Figure
22. We can see from the confusion matrix, that this is model has good specificity and bad sensitivity. The
accuracy rate seems to be okay.
SVM Discussion
In this section, we have created several SVM models with different kernels. In each case, we tuned the
parameters in order to determine the most appropriate values that would minimize the errors. The ROC plots
for each of these kernels can be seen in Figure 24. We see that the linear SVM performed the best out of all
three of the kernels tried. Thus, as seen previously, our data works well with linear assumptions.
Additionally, we see from the summary table in Figure 25 that a polynomial assumption produced a model
with terrible accuracy in addition to a low AUC value. Considering the cut-off values chosen, it appears that
a linear SVM produced the best results for this class of models.
NEURAL NETWORKS
The final type of model used in this analysis was a neural network. Neural networks are created from an iterative
learning algorithm that is composed of nodes. The input nodes are connected to weighted synapses; a signal flows
through the neurons and synapses in order to predict the outcome. Neural networks are able to adapt by changing
structure based on the predictive abilities of the training data being passed through the system. This model uses
calculated weights as input variables for which these weights are equivalent to the regression parameters of GLM.
Figure 21: Polynomial SVM Figure 22: Radial SVM
Figure 24: ROC SVM
Figure 25: SVM Model Statistics
12. All weights are from the standard normal distribution. We used the package “nnet” which allows for only 1 hidden
layer. In the future, we may wish to use another package so that we can have multiple hidden layers. Below are the
results found.
Using the caret package to perform 10-fold cross validation on our training set, we were able to tune the decay and
number of hidden layers’ parameters. The best results occurred when decay was 0.01 and there was a single hidden
layer, see Figure 26. An image of our neural network and can be seen in Appendix F along with all corresponding
code. We then produced a model using these parameter settings for which we created predictions from. The cut-of
value for classifying our predictions we set to 0.88. The ROC with AUC, confusion matrix and cut-off plot can be
seen in Figure 27.
RESULTS
Here, we have used multiple methods in order to construct potential models that can be used for predicting whether
or not a tornado can be harmful based on several features. We attempt to summarize the results from each of the
models that we have constructed in order to compare them and determine which model performed the best.
During this analysis, we looked at many different models
in order to determine which would perform the best in
terms of predicting harmful tornadoes based on our data.
In terms of a model that we classify as successful, it
should have large accuracy and sensitivity rates,
additionally, it should also have a large AUC value. A
comparative chart of our results for each of the
constructed models can be seen in Figure 28.
This table organizes the results from highest AUC to
lowest. As we can see, the random forest methods
performed the best in terms of AUC values.
Additionally, these methods produced relatively high
accuracy and sensitivity. From this table, we can see that
random forest methods are probably the best for the data
used here.
In addition, from the table we can see that the next group
of highest performing models in terms of AUC are all
linear methods: Ridge, GLM, Lasso, and LDA. Thus, we
can conclude that either regression or a linear boundary
is sufficient for classification for our data.
Figure 27: Neural Network Statistics
Figure 26: Neural Network Tuning
Figure 28: Model Statistic Comparison
13. When comparing AUC values for all of the models considered in this paper, we see that the more complex methods,
such as neural networks and SVM methods produced decent results in terms of AUC, as well as, accuracy and
sensitivity. However, these more complex methods did not perform as well as random forests in general across all
categories.
The K-nearest neighbor performed the worst for this classification problem, producing extremely low AUC values.
That is, random guessing would be more sufficient than these methods. Additionally, these methods produced
extremely low specificity rates.
Here, we have included the details of all 25 models we constructed and their performance ability are all summarized
in Figure 28. We have analyzed these results and determine that high AUC is the best indicator for a better model
for these classification problems. In addition to a high AUC, we also must have sufficient accuracy and sensitivity
rates in order to consider a model to be good.
DISCUSSION
In this paper, we have used several classification methods in order to predict whether tornadoes will cause injury or
fatalities (positive classification). The details of each method and advantages of the chosen methods have been
outline within their sections. A summary of the results from all models can be found in Figure 28.
The data used throughout this investigation includes 19 features, subsets of the features were used in those that do
not perform feature selection. In order to construct the models used in this project, we used training data to develop
the models that were later tested for predictive abilities from test data. A model is determined as sufficient when it
produces a large AUC value and high accuracy and sensitivity. Due to the imbalanced nature of the data used, we
choose to put more importance on the sensitivity rate rather than the accuracy rate.
Using a loop in R, we construct an accuracy-sensitivity trade-off plot in order to determine an appropriate cut-off
value for classification from the predicted probabilities. Using these cut-off values, we could determine the positive
classification and construct the ROC plots and confusion matrices.
In this investigation, we used regression methods, classification methods, tree methods, support vector machines,
and neural networks in order to predict classes for our data. Regression methods involved preliminary feature
selection and subsets of features were tested for each regression method. From Figure 28 in the results section, we
see that these regression methods were sufficient for predictions. They usually produced a high AUC value and
decent sensitivity and accuracy rates.
The subsets determined from the feature selection used for regression methods were also used for classification
methods. The classification methods considered in this paper were logistic regression, LDA, QDA, and KNN
classification. From our results section, we can see that the classification methods using linear boundaries were
better for AUC and sensitivity. Thus, we can conclude that methods using nonlinear boundaries are not sufficient for
this dataset. Additionally, the KNN classifiers performed the worst out of all the models constructed in this
investigation in terms of AUC.
From the results seen in the previous section, we determined that the random forests were the best methods in this
investigation. We performed random forests with and without boosting methods and all three models produced the
top AUC values. Additionally, each model yielded large accuracy and sensitivity values. In terms of our standards
for a good model, random forests with boosting using a Bernoulli distribution was the best model. Perhaps, the only
downfall of the random forests was the computation time for the large dataset. However, the predictive abilities of
these constructed models were very good.
In addition to all of these methods, we used more complex methods like SVM and neural networks. Using the caret
package in R, we were able to tune the parameters in order to produce the best model for each of these methods. We
found that these models produced good AUC values and the sensitivity accuracy rates typically fell around 75%,
14. which has been classified as successful in terms of predicting for this unbalanced data in comparison to other
models. Although the results for these SVM and neural network models were good and had good predictive abilities,
in comparison to simpler methods such as regression or classification methods it may not be worth the
computational costs to use such complex models.
Throughout this paper, we have investigated many models for the classification of harmful tornadoes. We have
concluded that random forests worked best for the predictive abilities of positive classification. Additionally, we
have determined that some of the less computationally sound methods were adequate for good predictions for this
data. In the future, an investigation that is more adaptive toward the unbalanced data should be considered in order
to improve the classification results seen in this paper. Other improvements for future work on this data set would
be to look at transforming variables, considering interaction terms, creating a feature called ZIP code using latitude
and longitude provided in the original data set, and perform under-sampling on the predominant class. With more
time permitting, we would run more models with a further investigation of these suggested possible improvements.
BIBLIOGRAPHY
[1] Edwards, R. (n.d.). The Online Tornado FAQ. Retrieved March 29, 2016, from
http://www.spc.noaa.gov/faq/tornado/
[2] Storm Prediction Center WCM Page. (n.d.). Retrieved March 29, 2016, from
http://www.spc.noaa.gov/wcm/#data, Severe Weather Database Files (1950-2015)
[3] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With
applications in R. New York: Springer.
INDIVIDUAL CONTRIBTUION TO PROJECT
This project was equally worked on and advice was sought from each other throughout the entire
process.
Katie
• Found data set and performed initial testing
• Prepared data in R to be used for various data mining methods
• Wrote R code
• Complied results into paper
Miranda
• Wrote R code
• Compiled results into paper
• Created presentation