Application of predictive analytics on semi-structured north Atlantic tropical cyclone forecasts (2/2017)

Application of predictive analytics on semi-
structured north Atlantic tropical cyclone forecasts
Dr. Caroline Howard, Ph.D., Research Supervisor and Chair
Dr. Richard Livingood, Ph.D., Committee Member
Dr. Cynthia Calongne, D.CS, Committee Member
By
Michael K. Hernandez
February 2017
Final Presentation

• Proposal Recap
• Problem Opportunity Statement
• Tropical Cyclone (TC) Lifecycle
• Three gaps in knowledge
• Research Question & Hypothesis
• Theoretical Framework and Lens
• Methodology
• Instrument, Sampling Procedure & Data Collection
• Finding
• Descriptive Analytics on TC data
• Term Document Frequency over time
• Information Gain
• Decision Trees
• Conclusions
• Implications for Practice
• Limitations
• Future Research
2
Overview of Presentation

3
Computer
Science
Hurricanes
Data
Analytics
Dissertation
Proposal Recap

4
Tropical Cyclones
(TCs) threaten
global coastlines,
annually.
2
TC threaten to
make landfall on
US coastlines
annually.
General Problem
50
%
Improvement in
forecast accuracy
is needed by
2019.
Specific Problem
Narrow focus on the use of
forecasting models and in-
situ data. Not on using data
analytics on text data.
TC forecasting
is a wicked
problem.
There are no
“one size fits all”
solution.
CentralProblem
This study attempts to solve one aspect of the
problem, due to the framing of the research
question.
Problem Opportunity Statement

5
Tropical
Storm
Tropical
Cyclones
Extratropical
Cyclones
(Jones et al. 2003; Hart & Evans 2001; Guishard & Evans, 2008)
Other
~10%
100%
Dissipate
Tropical Cyclone (TC) Lifecycle

6
Described the critical success factors to assess the improvement made on
forecasting Tropical Cyclone (TC) through the use of dynamical and
ensemble forecasting models, but they did not take into account other
methods of big data analytics.
Had identified that subject matter experts are not always available to
verify the importance and accuracy of the data mined results.
There is a need to add another instance of predictive text analytics to other
fields, thus deepening the body of knowledge further in one vertical (data
analytics).
(Gall et
al., 2013)
(Garcia,
Ferraz, &
Vivacqua,
2009)
(Corrales,
Ledezma,
&
Corrales,
2015)
5131 instances of explicit knowledge (containing over 1.35 million words) are in the form of tropical
discussions. Tropical discussions contain the explained their reasoning behind the National Hurricane Center
TC forecasts.
Study results were evaluated from both perspectives: meteorological and big data analytics.
The application of the big data analysis on meteorological data accomplished this.
Three Gaps in the Body of Knowledge

Which weather pattern components can improve the Atlantic
TC forecast accuracy; through the use of C4.5 algorithm on all
five-day tropical discussions from 2001-2015?
The null hypothesis (H0) in this study is non-directional,
whereas the alternative hypothesis (Ha) is directional:
• H0: There are no significant differences in the C4.5 algorithm
derived weather pattern components, which can decipher the
difference between a successful and unsuccessful TC forecast.
• Ha: There are significant differences in the C4.5 algorithm derived
weather pattern components, which can decipher the difference
between a successful and unsuccessful TC forecast.
7
Hypothesis
Research Question

8
Diffusion of
Innovation
(Theoretical Lens)
Financial Market Forecasts Tropical Cyclone Forecasts
Figure 2. Research design for text mining and this study.
Theoretical Framework & Lens

9
Figure 3. Research design for text mining and this study.
Text Mining
Predictive Data Analytics
Model Creation
Preprocessing Interpretation & Evaluation
import training data
Data Cleaning
CollectingRawData
Data Preparation
tokenization &
word dictionary
stop-word removal
word-normalization:
stemming & case similarity
common format
addressing missing data
algorithm & features
selection
Assess Model
Model Prediction
import testing data
removal of HTML tags
actual performance
measurements
review accuracy
positives, false
positives, negatives, &
false negatives
determine next steps
Integrating Data Sets
review process
Data Visualization
Methodology

10
• Microsoft Visual Studios: Screen Scraping tools
• Microsoft Excel: Data cleaning, integrating data sets, data preparation, descriptive stats
• WEKA: C4.5 Algorithm (predictive data analytics)
Instrumentation
Data Collection
• Entire population of tropical discussions: 9784 instances with 2.5M words
• Stratified purposive sampling:
• 66.66% used for training the C4.5 algorithm and 33.34% is used for testing the C4.5 algorithm results
• Atlantic Ocean basin tropical discussions: 5131 instances with 1.35M words
• Atlantic Ocean basin tropical discussions is from the National Hurricane Center
• Tropical verification scores is from the National Hurricane Center
• Total verifiable tropical discussion data sample: 4812 instances with 1.31M words
Sampling Procedure

• Descriptive analytics on TC data
• Interesting trends in initial TC intensity with forecast results
• Doesn’t showcase that “two heads are better than one”
• Term document frequency over time shows that token words generally
don’t change in frequency over time, ensuring homogeneous data.
• Information gained showed key tokens that should be further studied.
• Decision trees results show that this study fails to reject the null
hypothesis.
11
Findings

12Figure 4. Descriptive statistics showing the track and intensity classification scores.
The stronger the initial TC intensity,
the better the forecast track (c) and
vice versa for intensity forecasts (d).
Of the 4812 verifiable tropical
discussions, approximately 60% of
them (a & b) had better than average
forecast error.
No significant difference between
the number of forecasters and the
probability to the outcomes of either
track or intensity forecasts (e & f).
Descriptive Analytics on TC data

13
Figure 5. Red-white-green chart of the normalized frequency of certain token words
The tokenize words and
their normalized document
frequency per year show
that there are no trends in
the usage of words.
These tokenized words had
to be normalized per year,
to reduce the influence of
highly active Atlantic TC
Seasons; for instance,
2005 had the most active
TC season in recorded
history.
Term document frequency over time

14
Table 1: The information gained ranked scores on the track classification scores
* Highlighted tokens appeared in all three runs
Information Gain on Track Forecasts

15
Table 2: The information gained ranked scores on the intensity classification scores
* Highlighted tokens appeared in all three runs
Information Gain on Intensity Forecasts

16
• Ranked as non-zero information gain tokens, from
all randomly sampled training data sets:
• TC eye
• reconnaissance
• TC eyewall
• eyewall replacement
• Suggesting that gaining a further understanding of
these tokens are key for improving the overall TC
forecasts and warrant more research on them.
Information Gain Summary

17
• Meets the 55%
threshold value to be
considered a
successful method
for classification.
• Spread between
these values is
small, ensuring
validity of the
method.
• The average kappa
statistic value is
under 0.20 showing
slight to no inter-
rater agreement.
• Also, shows that we
cannot reject H0.
Table 3: Descriptive statistics for the randomly sampled C4.5 decision trees for all runs at a
90% confidence interval.
Decision Tree Summary

18
Intensity Run #1
Track Run #1
To the first approximation the TC track is
dependent on environment conditions and
steering flow whereas, TC intensity is
dependent on the internal dynamics of the
storm.
Figure 6. C4.5 output
for the first of three
randomly sampled
classified track &
intensity outcomes
Sample Decision Trees

19
Intensity Run #1
Track Run #1
Steering was never brought up in the ranked
information gain on track forecasts, indicating
why this algorithm’s inability to correctly
decipher which weather components (via the
kappa statistic) aided in improving the
forecasts.
Figure 6. C4.5 output
for the first of three
randomly sampled
classified track &
intensity outcomes
Sample Decision Trees

• Failed to reject the null hypothesis:
• There are no significant differences in the C4.5 algorithm derived weather
pattern components, which can decipher the difference between a
successful and unsuccessful TC forecast.
• All three Gaps in the body of knowledge have been filled in.
20
Conclusions

• Known Limitations
• the knowledge that was either included or excluded from the tropical
discussion but still used as part of the TC analysis by the hurricane
specialist
• analysis of a static 15-year snapshot of TC in one oceanic basin
• the C4.5 algorithm was the sole predictive analytical algorithm
• Emerging Limitations
• the words used for stemming and tokenization came from the term
document frequency thresholds of approximately the top 1000 terms
during the preprocessing phase.
• the binary classification of forecasts, which was initially chosen to
aid in generating simple decision trees.
• the interactions between track forecast errors and intensity errors
could have played a role in providing a low kappa statistic value.
• The testing to training data ratio of 66.66% to 33.34% could have
been varied in this study to encompass the huge range that exists in
the body of literature(50%-90% of their entire dataset for training)
21
Limitations

Recommendations for Practitioners:
1. look to other tangential fields to help find new innovative ways to
solve their current problems.
2. analyze the results from all perspectives, which is the best approach
to analyzing a result from a project that stems from multiple
perspectives.
3. take into account all the different fields of study when combining
fields to solve a problem; if not, the conclusions are not complete.
4. apply predictive analytical processes and techniques to other
weather components and phenomena, i.e. tornado forecasting.
5. could prioritize projects on the four tokens (TC eye, eyewall,
eyewall replacement, and the reconnaissance program), to yield a
higher return on investment.
6. create a checklist for weather components to be analyzing and
forecasting TCs that are great for knowledge sharing from the 60
tokens/weather components derived from this study.
22
Implications for Practice

23
1| Data analytics research:
More fields need to adopt data analysis in order to help deepen the body of knowledge
further in data analysis.
2| Meteorological research :
Use the same research question and hypothesis on the remaining different oceanic basins as
an immediate next step: North Eastern Pacific, North Western Pacific, North Indian, South
Western Indian, South Eastern Indian, and South Western Pacific.
3| Computer science, data analytics, and meteorological research:
Could focus on changing the predictive text analytics algorithm because by changing the
algorithm and testing that different algorithm against the same dataset should allow a future
researcher to obtain different results that could be statistically significant.
Future Research
4| Data analytics, and meteorological research:
This study could act as a foundation for doing a predictive text analytics on the TC
Reanalysis project. A proposed project could be to analyze the text reports generated from
this project to see what are common issues, readjustments, and re-analysis made on the
“best track” data, to help improve first time quality in future hurricane specialist’s tropical
discussion.

Application of predictive analytics on semi-structured north Atlantic tropical cyclone forecasts (2/2017)

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Application of predictive analytics on semi-structured north Atlantic tropical cyclone forecasts (2/2017)

Similar to Application of predictive analytics on semi-structured north Atlantic tropical cyclone forecasts (2/2017) (20)

More from Skylar Hernandez

More from Skylar Hernandez (6)

Recently uploaded

Recently uploaded (20)

Application of predictive analytics on semi-structured north Atlantic tropical cyclone forecasts (2/2017)

Editor's Notes