Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Understanding Web Analytics by Smartdog digital 2794 views
- Step by Step – A Process for Buildi... by Inside Analysis 1900 views
- "In God We Trust, All Others Bring ... by Kevin Brett, M.A. 811 views
- Developing a User Experience Strategy by UXPA International 1867 views
- Statistical Machine Learning from D... by butest 830 views
- Statistical Learning by Pingshuai Cao 47 views

973 views

Published on

A Guideline to Statistical and Machine Learning

No Downloads

Total views

973

On SlideShare

0

From Embeds

0

Number of Embeds

10

Shares

0

Downloads

20

Comments

0

Likes

3

No embeds

No notes for slide

- 1. A Guideline for Statistical and Machine Learning Alexandre Alves, June/12/2014
- 2. Define your Goal
- 3. Define your Goal Are you interested on predicting or inferring your data? Prediction is a black-box method: given values for the features X1, …, Xp, it predicts the value of the response Y. Inference is a white-box method: how is the response Y affected as the features X1, …, Xp change.
- 4. Define your Goal People tend to think they need to predict, but more often than not inference will give them more insight: In an advertisement campaign, which media contributed most to sales? Analyzing a business process failure, which attribute of the process contributes the most to a negative outcome? Given an increase in height, what is the expected increase in weight? You must have a goal in mind in the form of a Question to be answered by means of analyzing the Observations in your data.
- 5. Define the Model
- 6. Define the Model Looking at the Observations, is the Response present in the data? In a history of fraudulent transactions, the outcome of fraud or not fraud is speciﬁed in the transactions themselves. If so, then you are looking at a Supervised model, and there is a Response variable. Or is the Response not in the data? In a ﬁnancial market Exchange, which stocks are hot? The trade transactions do not include a variable specifying if the stock is hot or not hot! In this case, you are looking at an Unsupervised model.
- 7. Supervised Models Is the Response variable quantitative? What’s the weight? What’s the price? What’s the income? You are dealing with a Regression problem. Or is the Response variable qualitative (categorical)? Is it fraud? What’s the gender - male or female? What’s the brand - A, B, C? You are looking into a Classiﬁcation problem.
- 8. Regression Problems Is there a somewhat linear relationship between the features and the response? Gas consumption for horsepower. Fit a Linear model to your Observations. Is there no clear relationship or form between the features and the response? Gas consumption for year of the car model. Prefer a non-parametric method, such Regression Splines and Generalized Additive Models.
- 9. Classification Problems Is the Response made of only two categories (e.g. yes/no)? Fit a Logistic regression model to your Observations. Is there a somewhat linear boundary between the categories of the Response? Use Linear Discriminant Analysis. Is there no clear boundary form between the categories, but is the probability distribution of the categories known? Use a Naive Bayes Classiﬁer. Otherwise if no clear boundary and distribution is not known: Use K-Nearest Neighbors.
- 10. Unsupervised Models Unsupervised learning is a relative new ﬁeld Is there a desired number of groups or categories? Hot stocks (ﬁnancial derivatives) and Not-so Hot K-Means Clustering Otherwise if number of groups is not known: Stocks A an B trend together, stocks C and D trend together, stocks E and F… Hierarchical Clustering
- 11. Train, (and Re-train) the Model
- 12. Assessing the Model The model is created by ﬁtting the Observations. The Accuracy of the model must be assessed: If a regression problem, then measure the mean squared error. If a classiﬁcation problem, then measure the error rate. Being able to measure, now we can try different methods to improve the model: Leave-k-out of the test data and Cross-Validate. Bootstrap by resampling.
- 13. Improving the Model The possible ﬁndings are: Change the features used in the Model: Car color has no correlation to gas consumption, thus remove it from Model. Change the interaction between the features: Horsepower to gas consumption is not strictly linear, thus square the horsepower variable. Change the model: Low accuracy is a good indication that the selected Model is wrong.
- 14. Trade-offs Models that tend to have high accuracy are hard to interpret and therefore inappropriate for inference Linear regressions easy to interpret, however have low accuracy. Support-Vector-Machines are very ﬂexible, however can’t be easily interpreted. Models that tend to be ﬂexible are less biased, however don’t cope well to variances in the training data Linear regressions are biased towards a linear form, however cope well with variances to the training data. k-NN has no bias, however has high variance as the training data changes. Flexibility versus Interpretability, Bias versus Variance
- 15. –William Deming “In God we trust, all others bring data.” ” –George Box “All models are wrong, some are useful.” ” –Rutherford Roger “We are drowning in information and starving for knowledge.” ”

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment