Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- VSSML16 LR2. Summary Day 2 by BigML, Inc 400 views
- VSSML16 L6. Feature Engineering by BigML, Inc 560 views
- BSSML16 L1. Introduction, Models, a... by BigML, Inc 466 views
- [Eestec] Machine Learning online se... by Grigoris Chrysos 1331 views
- VSSML16 L1. Introduction, Models, a... by BigML, Inc 736 views
- Introduction to Machine Learning by Rahul Jain 80739 views

718 views

Published on

Valencian Summer School in Machine Learning 2016

Day 1 VSSML16

Lecture 2

Ensembles and Logistic Regression

Poul Petersen (BigML)

https://bigml.com/events/valencian-summer-school-in-machine-learning-2016

Published in:
Data & Analytics

No Downloads

Total views

718

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

76

Comments

0

Likes

2

No embeds

No notes for slide

- 1. September 8-9, 2016
- 2. BigML, Inc 2 Ensembles Poul Petersen CIO, BigML, Inc Making trees unstoppable
- 3. BigML, Inc 3Supervised Learning • Rather than build a single model… • Combine the output of several “weaker” models into a powerful ensemble… • Q1: Why would this work? • Q2: How do we build “weaker” models? • Q3: How do we “combine” models? Ensemble Idea
- 4. BigML, Inc 4Supervised Learning 1. Every “model” is an approximation of the “real” function and there may be several good approximations. 2. ML Algorithms use random processes to solve NP-hard problems and may arrive at diﬀerent “models” depending on the starting conditions, local optima, etc. 3. A given ML algorithm may not be able to exactly “model” the real characteristics of a particular dataset. 4. Anomalies in the data may cause over-ﬁtting, that is trying to model behavior that should be ignored. By using several models, the outliers may be averaged out. Why Ensembles In any case, if we find several accurate “models”, the combination may be closer to the real “model”
- 5. BigML, Inc 5Supervised Learning Ensemble Demo #1
- 6. BigML, Inc 6Supervised Learning Weaker Models 1. Bootstrap Aggregating - aka “Bagging” If there are “n” instances, each tree is trained with “n” instances, but they are sampled with replacement. 2. Random Decision Forest - In addition to sampling with replacement, the tree randomly selects a subset of features to consider when making each split. This introduces a new parameter, the random candidates which is the number of features to randomly select before making the split.
- 7. BigML, Inc 7Supervised Learning Over-ﬁtting Example Diameter Color Shape Fruit 4 red round plum 5 red round apple 5 red round apple 6 red round plum 7 red round apple Bagging! Random Decision Forest! All Data: “plum” Sample 2: “apple” Sample 3: “apple” Sample 1: “plum” }“apple” What is a round, red 6cm fruit?
- 8. BigML, Inc 8Supervised Learning 1. Plurality - majority wins. 2. Conﬁdence Weighted - majority wins but each vote is weighted by the conﬁdence. 3. Probability Weighted - each tree votes the distribution at it’s leaf node. 4. K Threshold - only votes if the speciﬁed class and required number of trees is met. For example, allowing a “True” vote if and only if at least 9 out of 10 trees vote “True”. 5. Conﬁdence Threshold - only votes the speciﬁed class if the minimum conﬁdence is met. Voting Methods Linear and non-linear combinations of votes using stacking
- 9. BigML, Inc 9Supervised Learning Ensemble Demo #2
- 10. BigML, Inc 10Supervised Learning Model vs Bagging vs RF Model Bagging Random Forest Increasing Performance Decreasing Interpretability Increasing Stochasticity Increasing Complexity
- 11. BigML, Inc 11Supervised Learning Ensemble Demo #3
- 12. BigML, Inc 12Supervised Learning • How many trees? • How many nodes? • Missing splits? • Random candidates? • Too many parameters? SMACdown
- 13. BigML, Inc 2 Logistic Regression Poul Petersen CIO, BigML, Inc Modeling probabilities
- 14. BigML, Inc 3Supervised Learning • Classiﬁcation implies a discrete objective. How can this be a regression? • Why do we need another classiﬁcation algorithm? • more questions…. Logistic Regression Logistic Regression is a classification algorithm
- 15. BigML, Inc 4Supervised Learning Linear Regression
- 16. BigML, Inc 5Supervised Learning Linear Regression
- 17. BigML, Inc 6Supervised Learning Polynomial Regression
- 18. BigML, Inc 7Supervised Learning • What function can we ﬁt to discrete data? Regression Key Take-Away: Fitting a function to the data
- 19. BigML, Inc 8Supervised Learning Discrete Data Function?
- 20. BigML, Inc 9Supervised Learning Discrete Data Function? ????
- 21. BigML, Inc 10Supervised Learning Logistic Function •x→-∞ : f(x)→0 •x→∞ : f(x)→1 •Looks promising, but still not "discrete"
- 22. BigML, Inc 11Supervised Learning Probabilities P≈0 P≈10<P<1
- 23. BigML, Inc 12Supervised Learning • Assumes that output is linearly related to "predictors" … but we can "ﬁx" this with feature engineering • How do we "ﬁt" the logistic function to real data? Logistic Regression LR is a classification algorithm … that models the probability of the output class.
- 24. BigML, Inc 13Supervised Learning Logistic Regression β₀ is the "intercept" β₁ is the "coefﬁcient" The inverse of the logistic function is called the "logit": In which case solving is now a linear regression
- 25. BigML, Inc 14Supervised Learning Logistic Regression If we have multiple dimensions, add more coefﬁcients:
- 26. BigML, Inc 15Supervised Learning Logistic Regression Demo #1
- 27. BigML, Inc 16Supervised Learning LR Parameters 1. Bias: Allows an intercept term. Important if P(x=0) != 0 2. Regularization: • L1: prefers zeroing individual coefﬁcients • L2: prefers pushing all coefﬁcients towards zero 3. EPS: The minimum error between steps to stop. 4. Auto-scaling: Ensures that all features contribute equally. • Unless there is a speciﬁc need to not auto-scale, it is recommended.
- 28. BigML, Inc 17Supervised Learning Logistic Regression • How do we handle multiple classes? • What about non-numeric inputs?
- 29. BigML, Inc 18Supervised Learning LR - Multi-Class • Instead of a binary class ex: [ true, false ], we have multi- class ex: [ red, green, blue, … ] • k classes • solve one-vs-rest LR • coefﬁcients βᵢ for each class
- 30. BigML, Inc 19Supervised Learning LR - Field Codings • LR is expecting numeric values to perform regression. • How do we handle categorical values, or text? Class color=red color=blue color=green color=NULL red 1 0 0 0 blue 0 1 0 0 green 0 0 1 0 NULL 0 0 0 1 One-hot encoding Only one feature is "hot" for each class
- 31. BigML, Inc 20Supervised Learning LR - Field Codings Dummy Encoding Chooses a *reference class* requires one less degree of freedom Class color_1 color_2 color_3 *red* 0 0 0 blue 1 0 0 green 0 1 0 NULL 0 0 1
- 32. BigML, Inc 21Supervised Learning LR - Field Codings Contrast Encoding Field values must sum to zero Allows comparison between classes …. so which one? Class ﬁeld "inﬂuence" red 0.5 positive blue -0.25 negative green -0.25 negative NULL 0 excluded
- 33. BigML, Inc 22Supervised Learning LR - Field Codings • The "text" type gives us new features that have counts of the number of times each token occurs in the text ﬁeld. "Items" can be treated the same way. token "hippo" "safari" "zebra" instance_1 3 0 1 instance_2 0 11 4 instance_3 0 0 0 instance_4 1 0 3 Text / Items ?
- 34. BigML, Inc 23Supervised Learning Logistic Regression Demo #2
- 35. BigML, Inc 24Supervised Learning Curvilinear LR Instead of We could add a feature Where ???? Possible to add any higher order terms or other functions to match shape of data
- 36. BigML, Inc 25Supervised Learning Logistic Regression Demo #3
- 37. BigML, Inc 26Supervised Learning LR versus DT • Expects a "smooth" linear relationship with predictors. • LR is concerned with probability of a discrete outcome. • Lots of parameters to get wrong: regularization, scaling, codings • Slightly less prone to over-ﬁtting • Because ﬁts a shape, might work better when less data available. • Adapts well to ragged non-linear relationships • No concern: classiﬁcation, regression, multi-class all ﬁne. • Virtually parameter free • Slightly more prone to over-ﬁtting • Prefers surfaces parallel to parameter axes, but given enough data will discover any shape. Logistic Regression Decision Tree
- 38. BigML, Inc 27Supervised Learning Logistic Regression Demo #4

No public clipboards found for this slide

Be the first to comment