5. What’s
Weather
Prediction
• Weather Forecast or Weather Prediction is to look at Modelling based on Past
Observation of Weather and Predict a Likelyhood of Weather in Future.
Model for Pattern
Extraction
Prediction
Temp <x&
wind speed
>y – no rain
Temp <x &
wind <y &
precipitation
< z- Rain
Temp
Precipitation
Wind speed
Current
Values, has it
rained in
Past ?
Past Observation
Machine Learning is
What Features , Groups of them ?
What Thresholds ?
What Correlations ?
Canonical example for data analysis/ machine learning.
Observations can be at various granularities. Many ways to get the weather data. Both commercial entities and national bodies
https://weatherspark.com/h/m/145212/2018/11/Historical-Weather-in-November-2018-at-San-Francisco-International-Airport-California-United-States
Quick snapshot of the observations about weather data.
Understand the domain of the problem, data characteristics
What are we trying to predict – precipation?
Focus in this workshop is to arrive at a model derived from data analysis rather than physics of the atmosphere etc.
Data
Set of values about a subject that describe it qualitatively or quantitatively. Features are the various components of the data.
Data science
Take data – understand it, process it, extract value from it…then communicate or act on the derived value
Pattern recognition
Auto-discovery of regularities in data. Once discovered, take action. E.g. classify data into categories.
Is it humanly possible to infer all patterns in the data? This is where algorithmic techniques come in
Walkthrough the excel sheet. A look at the various variables and maybe filter and show weathertype column. Are the classes clearly seen?
What to do when you see empty columns? Do we discard or do we put a equivalent value?
- How do you deal with outliers?
Quick look at the data. See how the variables vary.
Need for pre-processing and cleaning the data
understand the relationship between multiple variables and attributes in your dataset.
If your dataset has perfectly positive or negative attributes then there is a high chance that the performance of the model will be impacted by a problem called — “Multicollinearity”. Multicollinearity happens when one predictor variable in a multiple regression model can be linearly predicted from the others with a high degree of accuracy. This can lead to skewed or misleading results.
Having seen the weather data what kind of patterns do you see?
Talk of supervised vs unsupervised.
Where would this problem fit? Why?
Talk of classification.
Talk of decision tree
- Decision trees and boosted trees algorithms are immune to multicollinearity by nature.
When they decide to split, the tree will choose only one of the perfectly correlated features. However, other algorithms like Logistic Regression or Linear Regression are not immune to that problem and you should fix it before training the model.
Decision tree J48 is the implementation of algorithm ID3 (Iterative Dichotomiser 3) developed by the WEKA project team. R includes this nice work into package RWeka.
Go through DT, look at rules, see the accuracy, TP etc
TP – true positive – correctly predicted a positive (e.g. cat)
TN – true negative – correctly predicted a negative (not a cat)
FP – incorrectly predicted a positive – (dog as cat)
FN – incorrectly predicted a negative (cat predicted as not cat)
Precision: What proportion of positive identifications was actually correct?
P = TP/ TP + FP
Recall: What proportion of actual positives was identified correctly?
R = TP / TP + FN
Pattern recognition
Auto-discovery of regularities in data. Once discovered, take action. E.g. classify data into categories.
Is it humanly possible to infer all patterns in the data? This is where algorithmic techniques come in.
Data
Set of values about a subject that describe it qualitatively or quantitatively. Features are the various components of the data.
Data science
Take data – understand it, process it, extract value from it…then communicate or act on the derived value
Techcon ML Challenge – sets the context for the talk
https://hpe.sharepoint.com/sites/F5/CTO/Office/tech-con/Pages/2020-tech-con-challenge.aspx
Challenge:
Given relatively limited historical weather reports such as those available for The San Francisco International Airport up to a certain day, predict whether it will rain on the next day at that location.
From a paper on weather modelling:
“Making inferences and predictions about weather has been an omnipresent challenge throughout human history. Challenges with accurate meteorological modeling brings to the fore difficulties with reasoning about the complex dynamics of Earth's atmospheric system.”
Show and walkthrough the excel sheet. Show the various variables and maybe filter and show weathertype column. So, folks can see the classes.
What to do when you see empty columns? Do we discard or do we put a equivalent value?
How do you deal with outliers?
Having seen the weather data what kind of patterns do you see?
Talk of supervised vs unsupervised.
Where would this problem fit? Why?
Talk of classification.
How many attributes to pick? Do all of them matter?
As more features, density reduces and its easier to figure out hyperplanes that separate the classes. But, could result in overfitting.
A technique for dimensionality reduction is feature extraction
Pre-processing: standardization, normalization, discretization, signal enhancement, extraction of local features, etc.
All 3 below are part of feature selection
2. feature subset generation (search strategy)
3. evaluation criterion defn (relevance/ predictive power)
4. evaluation criterion estimation (assessment method)
A technique for dimensionality reduction is feature extraction
Pre-processing: standardization, normalization, discretization, signal enhancement, extraction of local features, etc.
All 3 below are part of feature selection
2. feature subset generation (search strategy)
3. evaluation criterion defn (relevance/ predictive power)
4. evaluation criterion estimation (assessment method)
https://towardsdatascience.com/why-how-and-when-to-apply-feature-selection-e9c69adfabf2
Quick look at the data
Go through DT, look at rules, see the accuracy, TP etc
TP – true positive – correctly predicted a positive (e.g. cat)
TF – true negative – correctly predicted a negative (not a cat)
FP – incorrectly predicted a positive – (dog as cat)
FN – incorrectly predicted a negative (cat predicted as not cat)
Precision: What proportion of positive identifications was actually correct?
P = TP/ TP + FP
Recall: What proportion of actual positives was identified correctly?
R = TP ?