Machine Learning Workshop

952 views

Published on

Presentation on Decision Trees and Random Forests given for the Boston Predictive Analytics Machine Learning Workshop on December 2, 2012. Code to accompany the slides is available at www.github.com/dgerlanc/mlclass or http://www.enplusadvisors.com/wp-content/uploads/2012/12/mlclass_1.0.tar.gz

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
952
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • John, Dave, and I have spoken a bit about the motivations for using Machine Learning techniques.
  • John, Dave, and I have spoken a bit about the motivations for using Machine Learning techniques.
  • Machine Learning Workshop

    1. 1. Hands on Classification:Decision Trees and RandomForestsPredictive Analytics Meetup GroupMachine Learning WorkshopDecember 2, 2012Daniel Gerlanc, Managing DirectorEnplus Advisors, Inc.www.enplusadvisors.comdgerlanc@enplusadvisors.com
    2. 2. © Daniel Gerlanc, 2012.All rights reserved.If you‟d like to use this material for anypurpose, please contactdgerlanc@enplusadvisors.com
    3. 3. What You‟ll Learn• Intuition behind decision trees and random forests• Implementation in R• Assessing the results
    4. 4. Dataset• Chemical Analysis of Italian Wines• http://www.parvus.unige.it/• 178 records, 14 attributes
    5. 5. Follow along> library(mlclass)> data(wine)> str(wine)data.frame: 178 obs. of 14 variables: $ Type : Factor w/ 2 levels "Grig","No": 2 2 2 2 2 2 2 2 2 2 ... $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
    6. 6. What are Decision Trees?• Model for partitioning an input space
    7. 7. What‟s partitioning?See rf-1.R
    8. 8. Create the 1 st split. Not G GSee rf-1.R
    9. 9. Create the 2 nd Split Not G G GSee rf-1.R
    10. 10. Create more splits… Not G G Not G GI drew this one in.
    11. 11. Another view of partitioningSee rf-2.R
    12. 12. Use R to do the partitioning. tree.1 <- rpart(Type ~ ., data=wine) prp(tree.1, type=4, extra=2)• See the „rpart‟ and „rpart.plot‟ R packages.• Many parameters available to control the fit. See rf-2.R
    13. 13. Make predictions on a test dataset predict(tree.1, data=wine, type=“vector”)
    14. 14. How‟d it do?Guessing: 60.11%CART: 94.38% Accuracy • Precision: 92.95% (66 / 71) • Sensitivity/Recall: 92.95% (66 / 71) ActualPredicted Grig noGrig (1) 66 (3) 5No (2) 5 (4) 102
    15. 15. Decision Tree Problems• Overfitting the data• May not use all relevant features• Perpendicular decision boundaries
    16. 16. Random ForestsOne Decision Tree Many Decision Trees (Ensemble)
    17. 17. Random Forest Fixes• Overfitting the data• May not use all relevant features• Perpendicular decision boundaries
    18. 18. Building RFFor each tree: Sample from the data At each split, sample from the available variables
    19. 19. Bootstrap Sampling
    20. 20. Sample Attributes at each split
    21. 21. Motivations for RF• Create uncorrelated trees• Variance reduction• Subspace exploration
    22. 22. Random Forestsrffit.1 <- randomForest(Type ~ ., data=wine)See rf-3.R
    23. 23. RF Parameters in RMost important parameters are:Variable Description Defaultntree Number of Trees 500mtry Number of variables to randomly • square root of # predictors for select at each node classification • # predictors / 3 for regressionnodesize Minimum number of records in a • 1 for classification terminal node • 5 for regressionsampsize Number of records to select in each • 63.2% bootstrap sample
    24. 24. How‟d it do?Guessing Accuracy: 60.11%Random Forest: 98.31% Accuracy • Precision: 95.77% (68 / 71) • Sensitivity/Recall: 100% (68 / 68) ActualPredicted Grig NoGrig (1) 68 (3) 3No (2) 0 (4) 107
    25. 25. Tuning RF: Grid SearchThis is the default. See rf-4.R
    26. 26. Tuning is Expensive
    27. 27. Benefits of RF• Good performance with default settings• Relatively easy to make parallel• Many implementations • R, Weka, RapidMiner, Mahout
    28. 28. References• A. Liaw and M. Wiener (2002). Classification and Regression by randomForest. R News 2(3), 18--22.• Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print.• Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.ht m

    ×