A Guideline to Statistical and Machine Learning


Published on

A Guideline to Statistical and Machine Learning

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

A Guideline to Statistical and Machine Learning

  1. 1. A Guideline for Statistical and Machine Learning Alexandre Alves, June/12/2014
  2. 2. Define your Goal
  3. 3. Define your Goal Are you interested on predicting or inferring your data? Prediction is a black-box method: given values for the features X1, …, Xp, it predicts the value of the response Y. Inference is a white-box method: how is the response Y affected as the features X1, …, Xp change.
  4. 4. Define your Goal People tend to think they need to predict, but more often than not inference will give them more insight: In an advertisement campaign, which media contributed most to sales? Analyzing a business process failure, which attribute of the process contributes the most to a negative outcome? Given an increase in height, what is the expected increase in weight? You must have a goal in mind in the form of a Question to be answered by means of analyzing the Observations in your data.
  5. 5. Define the Model
  6. 6. Define the Model Looking at the Observations, is the Response present in the data? In a history of fraudulent transactions, the outcome of fraud or not fraud is specified in the transactions themselves. If so, then you are looking at a Supervised model, and there is a Response variable. Or is the Response not in the data? In a financial market Exchange, which stocks are hot? The trade transactions do not include a variable specifying if the stock is hot or not hot! In this case, you are looking at an Unsupervised model.
  7. 7. Supervised Models Is the Response variable quantitative? What’s the weight? What’s the price? What’s the income? You are dealing with a Regression problem. Or is the Response variable qualitative (categorical)? Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C? You are looking into a Classification problem.
  8. 8. Regression Problems Is there a somewhat linear relationship between the features and the response? Gas consumption for horsepower. Fit a Linear model to your Observations. Is there no clear relationship or form between the features and the response? Gas consumption for year of the car model. Prefer a non-parametric method, such Regression Splines and Generalized Additive Models.
  9. 9. Classification Problems Is the Response made of only two categories (e.g. yes/no)? Fit a Logistic regression model to your Observations. Is there a somewhat linear boundary between the categories of the Response? Use Linear Discriminant Analysis. Is there no clear boundary form between the categories, but is the probability distribution of the categories known? Use a Naive Bayes Classifier. Otherwise if no clear boundary and distribution is not known: Use K-Nearest Neighbors.
  10. 10. Unsupervised Models Unsupervised learning is a relative new field Is there a desired number of groups or categories? Hot stocks (financial derivatives) and Not-so Hot K-Means Clustering Otherwise if number of groups is not known: Stocks A an B trend together, stocks C and D trend together, stocks E and F… Hierarchical Clustering
  11. 11. Train, (and Re-train) the Model
  12. 12. Assessing the Model The model is created by fitting the Observations. The Accuracy of the model must be assessed: If a regression problem, then measure the mean squared error. If a classification problem, then measure the error rate. Being able to measure, now we can try different methods to improve the model: Leave-k-out of the test data and Cross-Validate. Bootstrap by resampling.
  13. 13. Improving the Model The possible findings are: Change the features used in the Model: Car color has no correlation to gas consumption, thus remove it from Model. Change the interaction between the features: Horsepower to gas consumption is not strictly linear, thus square the horsepower variable. Change the model: Low accuracy is a good indication that the selected Model is wrong.
  14. 14. Trade-offs Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference Linear regressions easy to interpret, however have low accuracy. Support-Vector-Machines are very flexible, however can’t be easily interpreted. Models that tend to be flexible are less biased, however don’t cope well to variances in the training data Linear regressions are biased towards a linear form, however cope well with variances to the training data. k-NN has no bias, however has high variance as the training data changes. Flexibility versus Interpretability, Bias versus Variance
  15. 15. –William Deming “In God we trust, all others bring data.” ” –George Box “All models are wrong, some are useful.” ” –Rutherford Roger “We are drowning in information and starving for knowledge.” ”