Define your Goal
Are you interested on predicting or inferring your data?
Prediction is a black-box method: given values for the features X1, …, Xp, it
predicts the value of the response Y.
Inference is a white-box method: how is the response Y affected as the
features X1, …, Xp change.
Define your Goal
People tend to think they need to predict, but more often than not inference will give
them more insight:
In an advertisement campaign, which media contributed most to sales?
Analyzing a business process failure, which attribute of the process contributes the
most to a negative outcome?
Given an increase in height, what is the expected increase in weight?
You must have a goal in mind in the form of a Question to be answered by means of
analyzing the Observations in your data.
Define the Model
Looking at the Observations, is the Response present in the data?
In a history of fraudulent transactions, the outcome of fraud or not fraud is speciﬁed in the
If so, then you are looking at a Supervised model, and there is a Response variable.
Or is the Response not in the data?
In a ﬁnancial market Exchange, which stocks are hot? The trade transactions do not include
a variable specifying if the stock is hot or not hot!
In this case, you are looking at an Unsupervised model.
Is the Response variable quantitative?
What’s the weight? What’s the price? What’s the income?
You are dealing with a Regression problem.
Or is the Response variable qualitative (categorical)?
Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C?
You are looking into a Classiﬁcation problem.
Is there a somewhat linear relationship between the features and the response?
Gas consumption for horsepower.
Fit a Linear model to your Observations.
Is there no clear relationship or form between the features and the response?
Gas consumption for year of the car model.
Prefer a non-parametric method, such Regression Splines and Generalized Additive
Is the Response made of only two categories (e.g. yes/no)?
Fit a Logistic regression model to your Observations.
Is there a somewhat linear boundary between the categories of the Response?
Use Linear Discriminant Analysis.
Is there no clear boundary form between the categories, but is the probability distribution of the categories known?
Use a Naive Bayes Classiﬁer.
Otherwise if no clear boundary and distribution is not known:
Use K-Nearest Neighbors.
Unsupervised learning is a relative new ﬁeld
Is there a desired number of groups or categories?
Hot stocks (ﬁnancial derivatives) and Not-so Hot
Otherwise if number of groups is not known:
Stocks A an B trend together, stocks C and D trend together, stocks E and F…
Assessing the Model
The model is created by ﬁtting the Observations.
The Accuracy of the model must be assessed:
If a regression problem, then measure the mean squared error.
If a classiﬁcation problem, then measure the error rate.
Being able to measure, now we can try different methods to improve the model:
Leave-k-out of the test data and Cross-Validate.
Bootstrap by resampling.
Improving the Model
The possible ﬁndings are:
Change the features used in the Model:
Car color has no correlation to gas consumption, thus remove it from Model.
Change the interaction between the features:
Horsepower to gas consumption is not strictly linear, thus square the horsepower variable.
Change the model:
Low accuracy is a good indication that the selected Model is wrong.
Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference
Linear regressions easy to interpret, however have low accuracy.
Support-Vector-Machines are very ﬂexible, however can’t be easily interpreted.
Models that tend to be ﬂexible are less biased, however don’t cope well to variances in the training data
Linear regressions are biased towards a linear form, however cope well with variances to the
k-NN has no bias, however has high variance as the training data changes.
Flexibility versus Interpretability, Bias versus Variance
“In God we trust, all others bring data.”
“All models are wrong, some are useful.”
“We are drowning in information and
starving for knowledge.”