• Performed data cleaning and analysis using R, SAS to predict financial loss caused due to storms also predict when a storm will occur depending upon previous storm data
• Implemented algorithms like Logistic Regression, Multiple Regression, Linear Discriminant Analysis, PCA to obtain insights from the Storm Dataset from 1950-2007
2. 2
Overview:
• Tropical Cyclones, Storms and Tornados
cause huge amount of human and property
loss each year
• 1.9 million people have perished due to
cyclones during the last two years
• United States is one of the worst affected
countries in terms of property loss due to
cyclones
• In addition to the human loss United States
has suffered a property loss in excess of
$10 Billion for each of the last 8 years
• Can we predict the loss caused from
cyclones from past data and thereby
provide relevant insights to the disaster
management efforts to actually reduce the
loss?
3. 3
Project Summary
• Data Source: Dataset contains information on tornadoes from 1950 - 2015.
• Dataset created by National weather service and available at
http://www.spc.noaa.gov/gis/svrgis/
• Project Objective: We plan to analyze the storm data and provide insights that
can help the disaster management teams to better channelize their resources
for future cyclones
• Analysis will include state wise analysis of worst affected states
• We can also try to predict the the revenue loss which is a good indicator of
intensity of the cyclone and use this information to deploy rescue efforts as
soon as a new cyclone is predicted
4. 4
Data understanding
• Data contains 60,114 rows each containing an instance of cyclone and 21
columns/ attributes for each cyclone
• Following is a list of variables in the dataset:
Variables Nos. Variable Type/Description Variable names
1-7 Information regarding day,
date and time of tornadoes
om, yr(year), mo(month),
day, date, time tz(timezone)
8-9 State related information state, stf(State Fips no),
stn(state n0o.)
11-15 information related to
magnitude and loss in terms
of human life and money
mag, inj(injuries), fatalities,
loss, closs(crop loss)
16-21 Attributes for measuring
storm/ hurricane
slat(starting latitude),
slon(starting longitude),
elat, elon, len, wid
5. 5
Data Quality Check & Cleaning
• Correlation Matrix
Predictors that highly correlate with target variable are:
1. Magnitude
2. Fatalities
3. Length of Tornado
4. Width of Tornado
• Missing Values
There were NO missing values in the dataset.
• Outlier Detection
There were NO significant outliers found in dataset.
• Data Split:
Out of 60,114 instances of storms, we randomly splitted the data.
Training dataset contains 20,000 values
Testing dataset contains 40,114 values.
7. 7
State-wise loss Prediction
• This analysis aims to look at total property
loss and tornado frequency by state from
1996 through 2015, for which the data is
sliced from 1996 to 2015.
• The data is then indexed and aggregated by
state, providing the frequency and sum of
total property damage.
9. 9
Relief Measures Allocation
TX- Texas with 2767 as frequency of tornado occurrence needs to be
allocated with maximum relief measures.
10. 1
Multiple Linear Regresion
• Multiple linear regression attempts to model
the relationship between two or more
explanatory variables and a response variable
by fitting a linear equation to observed data.
Every value of the independent variable x is
associated with a value of the dependent
variable y.
• We created a model for multiple regression on
the training data and applied this model on the
tsting data
• As we can see from the analysis a total 16
variables are significant if we take loss as
dependent variable and all the remaining
variables as independent variables
11. 1
Step-wise Multiple Regression
• Stepwise regression only helps us confirm the
best variables for performing multiple
regression.
• We will use the result of stepwise regression in
further analysis
• Instead of using all the independent variables
we will use only the significant variables
provided in this analysis
• Again we have applied the model generated
using training data on the testing data
12. 1
Principal Component Analysis
• We next calculate the
principal components using
PCA.
• We get the principal
components as seen in the
screenshot:
13. 1
Proportion of Variance explained
• First 8 components explain 75% of
variance
• We now will perform algorithms
using the first 8 principal
components and check whether
principal components improve the
efficiency of our model
14. 1
Random Forest
• Random forests or random decision forests
are an ensemble learning method for
classification, regression and other tasks, that
operate by constructing a multitude of decision
trees at training time and outputting the class
that is the mode of the classes (classification)
or mean prediction (regression) of the individual
trees.
• We have the confusion matrix with results of
random forest prediction of loss on the testing
data with or without PCA
• In our case accuracy reduces by using the
principal components
Accuracy without PCA Accuracy with PCA
86.98% 85.38%
15. 1
Linear Discriminant Analysis
• Discriminant Analysis is used
to classify individuals into one
of two or more groups on the
basis of measurements
• We will try to classify the loss
of future cyclones as
low/Medium and High or 1,2,3
using the past data
16. 1
Linear Discriminant Analysis
• We have the results of LDA
confusion matrix without principal
components and LDA with principal
components
• As we can see the accuracy of the
model is better without regression
17. 1
K-Means to predict Emergency level
• K-means clustering algorithm is used to to partition n observations into k
clusters in which each observation belongs to the cluster with the nearest
mean, serving as a prototype of the cluster.
• K-means clustering is applied to Storm Dataset to define the different
levels(clusters) of emergency under which a particular storm can be defined.
• Length(in miles) and width(in yards) of the storm are used to build the clusters.
• Total of 60114 observations are partitioned into 6 clusters hence defining 6
levels of emergency with level 1 being the low emergency situation and level 6
being the high emergency level.
20. 2
Random Forest to predict frequency
of storms in different seasons
• Random forest algorithm is used to predict frequencies of storm in different
seasons so as to analyze the effect of climatic conditions on storms.
• Season data was created using the month of the occurrence of the tornado.
•
Months Season
1-2(January-February) Winter
3-6(March-June) Spring
7-9(July-September) Summer
10-12(October-December) Fall
21. 2
Random Forest to predict frequency
of storms in different seasons
• Confusion Matrix:
• Calculating % Accuracy:
Fall + Spring + Summer + Winter/(Number of Observations)
= 1357+36739+2228+460/ 60114
= 67.84%
22. 2
Random Forest to predict frequency
of storms in different seasons
• Accuracy = 67.84%
• Hence, our model was not only able to predict the % accuracy but was also able
to depict the difference in occurrence of storms in different seasons in U.S.
• After the analysis, it was found out that the occurrence of Storms were most
common in spring and least common in winter.
• The model can be used by the government entities such as disaster
management and rescue operations team to take the required precautions in
different seasons to avoid the loss.
23. 2
Conclusion
• We performed several different analysis such as analysis of state-wise loss,
predicting loss through classification models, predicting the seasons of cyclones
and clustering.
• We conclude that loss of the cyclones can be successfully predicted beforehand
and rescue efforts can be directed accordingly to increase the effectiveness of
rescue efforts.
• We also saw that for our data the results of prediction are better without
performing PCA. Hence we recommend that we can develop models without
doing dimension reduction in our dataset
• We found random forest to be most accurate in predicting loss with 86.98%
accuracy. Hence we will go ahead with this model for prediction
• We were able to predict the the level of emergency using clustering.
• We were also able to predict the seasons when storms are most likely to occur
and accordingly keep a tab on the readiness of the rescue efforts.