Decision Tree Problems• Overfitting the data• High variance• Not globally optimal
Random ForestsOne Decision Tree Many Decision Trees (Ensemble)
Building RF• Sample from the data• At each split, sample from the available variables• Repeat for each tree
Why more than 1?• Create uncorrelated trees• Reduce variance of predictor• Continual cross-validation
Random Forestsrffit.1 <- randomForest(takes.loan ~ ., data=bank)Most important parameters are: Variable Description Default ntree Number of Trees 500 mtry Number of variables to randomly square root of # predictors for select at each node classification, # predictors / 3 for regression
How‟d it do?Guessing Precision: 11.7%Random Forest: 64.5% ActualPredicted no yesno (1) 38,526 (3) 1396yes (2) 2748 (4) 2541
Benefits of RF• Don‟t need a lot of tuning• Don‟t need an extra cross validation step• Many implementations • R, Weka, RapidMiner, Mahout
References• Breiman, Leo. Classification and Regression Trees. Belmont, Calif: Wadsworth International Group, 1984. Print.• Brieman, Leo and Adele Cutler. Random forests. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_contact.ht m• S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.