Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

No machine learning algorithm dominates in every domain, but random forests are usually tough to beat by much. And they have some advantages compared to other models. No much input preparation needed, implicit feature selection, fast to train, and ability to visualize the model. While it is easy to get started with random forests, a good understanding of the model is key to get the most of them.

This talk will cover decision trees from theory, to their implementation in scikit-learn. An overview of ensemble methods and bagging will follow, to end up explaining and implementing random forests and see how they compare to other state-of-the-art models.

The talk will have a very practical approach, using examples and real cases to illustrate how to use both decision trees and random forests.

We will see how the simplicity of decision trees, is a key advantage compared to other methods. Unlike black-box methods, or methods tough to represent in multivariate cases, decision trees can easily be visualized, analyzed, and debugged, until we see that our model is behaving as expected. This exercise can increase our understanding of the data and the problem, while making our model perform in the best possible way.

Random Forests can randomize and ensemble decision trees to increase its predictive power, while keeping most of their properties.

The main topics covered will include:

* What are decision trees?

* How decision trees are trained?

* Understanding and debugging decision trees

* Ensemble methods

* Bagging

* Random Forests

* When decision trees and random forests should be used?

* Python implementation with scikit-learn

* Analysis of performance

No Downloads

Total views

1,139

On SlideShare

0

From Embeds

0

Number of Embeds

110

Shares

0

Downloads

51

Comments

0

Likes

2

No notes for slide

- 1. Understanding Random Forests Marc Garcia PyData Madrid - April 9th, 2016 1 / 44 Understanding Random Forests - Marc Garcia
- 2. Introduction 2 / 44 Understanding Random Forests - Marc Garcia
- 3. Introduction About me: @datapythonista Python programmer since 2006 Master in AI from UPC Working for Bank of America http://datapythonista.github.io 3 / 44 Understanding Random Forests - Marc Garcia
- 4. Introduction Overview: Understanding Random Forests How decision trees work Training Random Forests Practical example Analyzing model parameters 4 / 44 Understanding Random Forests - Marc Garcia
- 5. How decision trees work 5 / 44 Understanding Random Forests - Marc Garcia
- 6. How decision trees work Using a decision tree Simple example: Event attendance: binary classiﬁcation 2 explanatory variables: age and distance from sklearn.tree import DecisionTreeClassifier dtree = DecisionTreeClassifier() dtree.fit(df[[’age’, ’distance’]], df[’attended’]) cart_plot(dtree) 6 / 44 Understanding Random Forests - Marc Garcia
- 7. How decision trees work Data visualization 7 / 44 Understanding Random Forests - Marc Garcia
- 8. How decision trees work Decision boundary visualization 8 / 44 Understanding Random Forests - Marc Garcia
- 9. How decision trees work Decision boundary visualization 9 / 44 Understanding Random Forests - Marc Garcia
- 10. How decision trees work Decision boundary visualization 10 / 44 Understanding Random Forests - Marc Garcia
- 11. How decision trees work Decision boundary visualization 11 / 44 Understanding Random Forests - Marc Garcia
- 12. How decision trees work Decision boundary visualization 12 / 44 Understanding Random Forests - Marc Garcia
- 13. How decision trees work Decision boundary visualization 13 / 44 Understanding Random Forests - Marc Garcia
- 14. How decision trees work How is the model? def decision_tree_model(age, distance): if distance >= 2283.11: if age >= 40.00: if distance >= 6868.86: if distance >= 8278.82: return True else: return False else: return True else: return False else: if age >= 54.50: if age >= 57.00: return True else: return False else: return True 14 / 44 Understanding Random Forests - Marc Garcia
- 15. How decision trees work Training: Basic algorithm def train_decision_tree(x, y): feature, value = get_best_split(x, y) x_left, y_left = x[x[feature] < value], y[x[feature] < value] if len(y_left.unique()) > 1: left_node = train_decision_tree(x_left, y_left) else: left_node = None x_right, y_right = x[x[feature] >= value], y[x[feature] >= value] if len(y_right.unique()) > 1: right_node = train_decision_tree(x_right, y_right) else: right_node = None return Node(feature, value, left_node, right_node) 15 / 44 Understanding Random Forests - Marc Garcia
- 16. How decision trees work Best split Candidate split 1 age 18 19 21 27 29 34 38 42 49 54 62 64 attended F F T F T T F T F T T T Split True False Left 0 1 Right 7 4 Candidate split 2 age 18 19 21 27 29 34 38 42 49 54 62 64 attended F F T F T T F T F T T T Split True False Left 0 2 Right 7 3 16 / 44 Understanding Random Forests - Marc Garcia
- 17. How decision trees work Best split algorithm def get_best_split(x, y): best_split = None best_entropy = 1. for feature in x.columns.values: column = x[feature] for value in column.iterrows(): a = y[column < value] == class_a_value b = y[column < value] == class_b_value left_weight = (a + b) / len(y.index) left_entropy = entropy(a, b) a = y[column >= value] == class_a_value b = y[column >= value] == class_b_value right_items = (a + b) / len(y.index) right_entropy = entropy(a, b) split_entropy = left_weight * left_etropy + right_weight * right_entropy if split_entropy < best_entropy: best_split = (feature, value) best_entropy = split_entropy return best_split 17 / 44 Understanding Random Forests - Marc Garcia
- 18. How decision trees work Entropy For a given subset1: entropy = −Prattending · log2 Prattending − Pr¬attending · log2 Pr¬attending (1) 1 Note that for pure splits it’s assumed that 0 · log2 0 = 0 18 / 44 Understanding Random Forests - Marc Garcia
- 19. How decision trees work Entropy algorithm import math def entropy(a, b): total = a + b prob_a = a / total prob_b = b / total return - prob_a * math.log(prob_a, 2) - prob_b * math.log(prob_b, 2) 19 / 44 Understanding Random Forests - Marc Garcia
- 20. How decision trees work Information gain For a given split: information_gain = entropyparent − itemsleft itemstotal · entropyleft + itemsright itemstotal · entropyright (2) 20 / 44 Understanding Random Forests - Marc Garcia
- 21. Practical example 21 / 44 Understanding Random Forests - Marc Garcia
- 22. Practical example Etsy dataset Features: Feedback (number of reviews) Age (days since shop creation) Admirers (likes) Items (number of products) Response variable: Sales (number) Samples: 58,092 Source: www.bigml.com 22 / 44 Understanding Random Forests - Marc Garcia
- 23. Practical example Features distribution 23 / 44 Understanding Random Forests - Marc Garcia
- 24. Practical example Sales visualization 24 / 44 Understanding Random Forests - Marc Garcia
- 25. Practical example Feature correlation sales admirers age feedback items sales 1.000000 0.458261 0.181045 0.955949 0.502423 admirers 0.458261 1.000000 0.340939 0.401995 0.268985 age 0.181045 0.340939 1.000000 0.184238 0.167535 feedback 0.955949 0.401995 0.184238 1.000000 0.458955 items 0.502423 0.268985 0.167535 0.458955 1.000000 25 / 44 Understanding Random Forests - Marc Garcia
- 26. Practical example Correlation to sales 26 / 44 Understanding Random Forests - Marc Garcia
- 27. Practical example Decision tree visualization 27 / 44 Understanding Random Forests - Marc Garcia
- 28. Practical example Decision tree visualization (top) 28 / 44 Understanding Random Forests - Marc Garcia
- 29. Practical example Decision tree visualization (left) 29 / 44 Understanding Random Forests - Marc Garcia
- 30. Practical example Decision tree visualization (right) 30 / 44 Understanding Random Forests - Marc Garcia
- 31. Practical example What is wrong with the tree? Only one variable used to predict in some cases Lack of stability: If feedback = 207 we predict 109 sales If feedback = 208 we predict 977 sales Our model can change dramatically if we add few more samples to the dataset Overﬁtting: We make predictions based on a single sample 31 / 44 Understanding Random Forests - Marc Garcia
- 32. Training Random Forests 32 / 44 Understanding Random Forests - Marc Garcia
- 33. Training Random Forests Ensembles: The board of directors Why companies have a BoD? What is best? The best point of view A mix of good points of view How was our best tree? What about a mix of not so good trees? 33 / 44 Understanding Random Forests - Marc Garcia
- 34. Training Random Forests Ensemble of trees How do we build optimized trees, that are diﬀerent? We need to randomize them Samples: bagging Features Splits 34 / 44 Understanding Random Forests - Marc Garcia
- 35. Training Random Forests Bagging: Uniform sampling with replacement 35 / 44 Understanding Random Forests - Marc Garcia
- 36. Training Random Forests Randomizing features Even with bagging, trees will usually make the top splits by the same feature max_features parameter: Regression: Use all Classiﬁcation: Use the squared root 36 / 44 Understanding Random Forests - Marc Garcia
- 37. Training Random Forests Randomizing splits A split per sample is considered in the original decision tree We can consider a subset We obtain randomized trees Performance is improved sklearn implements ExtraTreeRegressor and ExtraTreeClassiﬁer 37 / 44 Understanding Random Forests - Marc Garcia
- 38. Training Random Forests What happened with our example problems Only one variable used to predict in some cases We can avoid it by not using all features Lack of stability: Our model is smoother Overﬁtting: We make predictions based on a single sample Overﬁtting is mostly avoided Random forests come with built-in cross validation 38 / 44 Understanding Random Forests - Marc Garcia
- 39. Training Random Forests Smoothing 39 / 44 Understanding Random Forests - Marc Garcia
- 40. Training Random Forests Parameter optimization (Etsy) 40 / 44 Understanding Random Forests - Marc Garcia
- 41. Training Random Forests Parameter optimization (Bank marketing) Data source: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 41 / 44 Understanding Random Forests - Marc Garcia
- 42. Training Random Forests When should we use Random Forests? Using a CART is usually good as part of exploratory analysis Do not use CART or RF when the problem is linear They are specially good for ordinal (categorical but sorted) variables Performance CART is slower than linear models (e.g. logit) RF trains N trees, and takes N times but it works in multiple cores, seriously But RF is usually much faster than ANN or SVM, and results can be similar 42 / 44 Understanding Random Forests - Marc Garcia
- 43. Training Random Forests Summary CART is a simple, but still powerful model Visualizing them we can better understand our data Ensembles usually improve the predictive power of models Random Forests ﬁx CART problems Better use of features More stability Better generalization (RFs avoid overﬁtting) Random Forests main parameters min_samples_leaf (inherited from CART) num_estimators 43 / 44 Understanding Random Forests - Marc Garcia
- 44. Training Random Forests Thank you QUESTIONS? 44 / 44 Understanding Random Forests - Marc Garcia

No public clipboards found for this slide

Be the first to comment