Cutting Edge Predictive Modeling For Classification
1. Cutting Edge Predictive Modeling for Classification
- A look at GBM in R
Pankaj Sharma
Oct 25, 2012
2. The views expressed herein are my own and do not
necessarily represent the views of my past / current
employer or people I have worked with.
My apologies if some sources have not have been
sighted. I make the best attempt to include the
relevant sources known to me.
1
3. Performance vs. Complexity
Predictive Performance
Simple Complex
Model Type
Highly complex models have won numerous competitions – Netflix Prize
2
5. Leo Breiman's Philosophy of Data Analysis
According to Leo Breiman (2001) there are two cultures in the use of statistical
modeling to reach conclusions from data.
- One assumes that the data are generated by a given stochastic data model.
- The other uses algorithmic models and treats the data mechanism as unknown.
• The statistical community has been committed to the almost exclusive use of data models.
“How much are we prepared to let the data tell us about the process we are studying?
For Breiman the answer would typically be ‘everything’, whereas for more traditional
statisticians the answer would be far more qualified” – Dan Steinberg
http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?handle=euclid.ss/1009213726&view=body&content-type=pdf_1
4
6. Best of-the-shelf Predictive Model for Classification
Missing values
Variable scaling
Correlated variables Model should be able to handle it all
Variable transformations
Categorical variables
In 90’s - Decision Tree
Today - Stochastic Gradient Boosting
https://www.salford-systems.com/en/products/treenet
5
7. Modeling in R – Some Available Functions
Taken from Revolution R Presentation on Introduction to Data Mining
6
8. Advanced Analytics begins in R
Cutting Edge Data Mining Not difficult to learn R.
Algorithms are usually In fact R is easier
implemented in R first and intuitive than SAS
Friedman, J. H. "Tutorial: Getting Started with MART in R." (April 2002)
MART with R - MART(tm) is an implementation of the gradient tree boosting methods for predictive data mining
(regression and classification) described in Greedy Function Approximation: a Gradient Boosting Machine and
Stochastic Gradient Boosting.
http://www-stat.stanford.edu/~jhf/R-MART.html
7
9. KDD cup 2009 winner used R
Entrants were given 50k records with 15k variables of CRM data from a telecommunication
company and tasked with predicting three target variables:
1. Churn (likelihood of a customer switching to another provider)
2. Appetency (propensity to buy)
3. Likelihood of response to up-selling
Fast Track Slow Track
IBM Research University of Melbourne
1st prize Ensemble Selection for the KDD Cup Orange Challenge Boosting
ID Analytics, Inc Financial Engineering Group, Inc. Japan
2nd prize KDD Cup Fast Scoring on a Large Database - TreeNet Stochastic Gradient Boosting
Old dogs with new tricks (David Slate, Peter W. Frey) National Taiwan University, Computer Science and Information Engineering
None Fast Scoring on a Large Database using regularized maximum entropy model,
3rd prize
categorical/numerical balanced AdaBoost and selective Naive Bayes
Fast challenge - complete in five days & Slow challenge – one month deadline from dataset availability
1, 2 and 3rd prize winners used some form of Gradient Boosting
8
10. How to win the KDD Cup Challenge with R and GBM – Hugh Miller
Feature selection is an important first step for all successful data mining projects.
- For categorical variables we just took the average number of 1's in the response for each
category and used this as a predictor.
- For continuous variables we split the variable up into "bins", as you would via histogram, and
again took the average number of 1's in the response for each bin as the predictor.
The main model was a gradient boosted machine which used the "gbm" package in R.
This basically fits a series of small decision trees, up-weighting the observations that
are predicted poorly at each iteration. We used Bernoulli loss and also up-weighted the
"1" response class. A fair amount of time was spent optimizing the number of trees,
how big they should be etc, but a fit of 5,000 trees only took a bit over an hour to fit.
We used trees to avoid doing much data cleaning – they automatically allow for
extreme results, non-linearity, missing values and handle both categorical and
continuous variables. The main adjustment we had to make was to aggregate the
smaller categories in the categorical variables, as they tended to distort the fits.
http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html
9
11. R / SAS Expertise
The Tipping Point: How Little Things Make a Big Difference (2000)
Blink: The Power of Thinking Without Thinking (2005)
Outliers: The Story of Success (2008)
Same logic applies in SAS – macros in SAS are similar to functions in R
Taken from Revolution R Presentation on Introduction to Data Mining
10
12. Case Study – Current Approach vs. GBM
GBM shows promise of higher lift and response rate in the top decile of validation data
(historical direct marketing program) when compared to the current approach
Lift - current way Lift - GBM ks - current way ks - GBM
0.7 1
Lift and KS numbers are masked in this chart
Lift
KS
0 0
1 2 3 4 5 6 7 8 9 10
Decile
# one way of building the gbm model in R
library(gbm)
mdl <- gbm (resp ~ ., data=t, distribution = "bernoulli", n.trees = 1000, interaction.depth = 1, n.minobsinnode = 20, shrinkage = 0.1,
bag.fraction = 0.5, train.fraction = 1, keep.data = TRUE, verbose = TRUE, cv.folds=3)
http://cran.r-project.org/web/packages/gbm/index.html
11