SlideShare a Scribd company logo
1 of 13
Cutting Edge Predictive Modeling for Classification
- A look at GBM in R




Pankaj Sharma
Oct 25, 2012
The views expressed herein are my own and do not
necessarily represent the views of my past / current
      employer or people I have worked with.

My apologies if some sources have not have been
 sighted. I make the best attempt to include the
         relevant sources known to me.




                         1
Performance vs. Complexity




                   Predictive Performance




                                            Simple                Complex
                                                     Model Type



       Highly complex models have won numerous competitions – Netflix Prize

                                                     2
Brief Mention of Modeling Techniques


    Statistical Models
      - Linear regression
      - Logistic regression
      - Naïve Bayes

    Semi-Parametric Models
      - Credit Scoring
      - GAM: generalized additive model
      - GNBC: generalized naïve bayes classifier

    Algorithmic / Non-Parametric Models
      -   MARS: multiple adaptive regression splines
      -   Gradient Boosting / TreeNet
      -   SVM: support vector machines
      -   Random Forests
      -   knn: k nearest neighbors




                                                       3
Leo Breiman's Philosophy of Data Analysis


     According to Leo Breiman (2001) there are two cultures in the use of statistical
      modeling to reach conclusions from data.
        - One assumes that the data are generated by a given stochastic data model.
        - The other uses algorithmic models and treats the data mechanism as unknown.
           • The statistical community has been committed to the almost exclusive use of data models.




      “How much are we prepared to let the data tell us about the process we are studying?
       For Breiman the answer would typically be ‘everything’, whereas for more traditional
              statisticians the answer would be far more qualified” – Dan Steinberg




   http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?handle=euclid.ss/1009213726&view=body&content-type=pdf_1

                                                             4
Best of-the-shelf Predictive Model for Classification




         Missing values
         Variable scaling
         Correlated variables                          Model should be able to handle it all
         Variable transformations
         Categorical variables



                            In 90’s - Decision Tree

               Today - Stochastic Gradient Boosting
                             https://www.salford-systems.com/en/products/treenet




                                                   5
Modeling in R – Some Available Functions




                      Taken from Revolution R Presentation on Introduction to Data Mining

                                                        6
Advanced Analytics begins in R




                  Cutting Edge Data Mining                                     Not difficult to learn R.
                   Algorithms are usually                                        In fact R is easier
                   implemented in R first                                      and intuitive than SAS




              Friedman, J. H. "Tutorial: Getting Started with MART in R." (April 2002)


      MART with R - MART(tm) is an implementation of the gradient tree boosting methods for predictive data mining
      (regression and classification) described in Greedy Function Approximation: a Gradient Boosting Machine and
                                               Stochastic Gradient Boosting.



                                     http://www-stat.stanford.edu/~jhf/R-MART.html

                                                           7
KDD cup 2009 winner used R



       Entrants were given 50k records with 15k variables of CRM data from a telecommunication
       company and tasked with predicting three target variables:
       1. Churn (likelihood of a customer switching to another provider)
       2. Appetency (propensity to buy)
       3. Likelihood of response to up-selling

                                     Fast Track                                                        Slow Track
                                   IBM Research                                                  University of Melbourne
    1st prize   Ensemble Selection for the KDD Cup Orange Challenge                                     Boosting

                                ID Analytics, Inc                                        Financial Engineering Group, Inc. Japan
    2nd prize   KDD Cup Fast Scoring on a Large Database - TreeNet                             Stochastic Gradient Boosting

                Old dogs with new tricks (David Slate, Peter W. Frey)   National Taiwan University, Computer Science and Information Engineering
                                       None                              Fast Scoring on a Large Database using regularized maximum entropy model,
    3rd prize
                                                                              categorical/numerical balanced AdaBoost and selective Naive Bayes


                         Fast challenge - complete in five days & Slow challenge – one month deadline from dataset availability



           1, 2 and 3rd prize winners used some form of Gradient Boosting




                                                                          8
How to win the KDD Cup Challenge with R and GBM – Hugh Miller


    Feature selection is an important first step for all successful data mining projects.
      - For categorical variables we just took the average number of 1's in the response for each
        category and used this as a predictor.
      - For continuous variables we split the variable up into "bins", as you would via histogram, and
        again took the average number of 1's in the response for each bin as the predictor.

    The main model was a gradient boosted machine which used the "gbm" package in R.
     This basically fits a series of small decision trees, up-weighting the observations that
     are predicted poorly at each iteration. We used Bernoulli loss and also up-weighted the
     "1" response class. A fair amount of time was spent optimizing the number of trees,
     how big they should be etc, but a fit of 5,000 trees only took a bit over an hour to fit.

    We used trees to avoid doing much data cleaning – they automatically allow for
     extreme results, non-linearity, missing values and handle both categorical and
     continuous variables. The main adjustment we had to make was to aggregate the
     smaller categories in the categorical variables, as they tended to distort the fits.




             http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html

                                                       9
R / SAS Expertise




                                            The Tipping Point: How Little Things Make a Big Difference (2000)
                                                  Blink: The Power of Thinking Without Thinking (2005)
                                                          Outliers: The Story of Success (2008)




           Same logic applies in SAS – macros in SAS are similar to functions in R

                        Taken from Revolution R Presentation on Introduction to Data Mining

                                                          10
Case Study – Current Approach vs. GBM


                GBM shows promise of higher lift and response rate in the top decile of validation data
                (historical direct marketing program) when compared to the current approach
                                              Lift - current way       Lift - GBM            ks - current way       ks - GBM

                           0.7                                                                                                          1




                                                                                    Lift and KS numbers are masked in this chart
                    Lift




                                                                                                                                            KS
                            0                                                                                                           0
                                 1       2          3              4       5             6            7         8         9        10

                                                                                Decile
       # one way of building the gbm model in R
       library(gbm)
       mdl <- gbm (resp ~ ., data=t, distribution = "bernoulli", n.trees = 1000, interaction.depth = 1, n.minobsinnode = 20, shrinkage = 0.1,
       bag.fraction = 0.5, train.fraction = 1, keep.data = TRUE, verbose = TRUE, cv.folds=3)


                                        http://cran.r-project.org/web/packages/gbm/index.html
                                                                               11
My Journey from SAS to R




                           12

More Related Content

Similar to Cutting Edge Predictive Modeling For Classification

Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labImpetus Technologies
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Kun Le
 
11.1. PPT on How to crack ML Competitions all steps explained.pptx
11.1. PPT on How to crack ML Competitions all steps explained.pptx11.1. PPT on How to crack ML Competitions all steps explained.pptx
11.1. PPT on How to crack ML Competitions all steps explained.pptxhu153574
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O Sri Ambati
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
Lecture 2 1_11_2012_data_mining_process
Lecture 2 1_11_2012_data_mining_processLecture 2 1_11_2012_data_mining_process
Lecture 2 1_11_2012_data_mining_processkittynmhao
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Massimo Gaetano Panunzio
 
RNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationRNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationNick Pentreath
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxdongchangim30
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...DB Tsai
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Ed Kohlwey
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Greg Makowski
 

Similar to Cutting Edge Predictive Modeling For Classification (20)

Srikanta Mishra
Srikanta MishraSrikanta Mishra
Srikanta Mishra
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
11.1. PPT on How to crack ML Competitions all steps explained.pptx
11.1. PPT on How to crack ML Competitions all steps explained.pptx11.1. PPT on How to crack ML Competitions all steps explained.pptx
11.1. PPT on How to crack ML Competitions all steps explained.pptx
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Lecture 2 1_11_2012_data_mining_process
Lecture 2 1_11_2012_data_mining_processLecture 2 1_11_2012_data_mining_process
Lecture 2 1_11_2012_data_mining_process
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
 
RNNs for Recommendations and Personalization
RNNs for Recommendations and PersonalizationRNNs for Recommendations and Personalization
RNNs for Recommendations and Personalization
 
acmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptxacmsigtalkshare-121023190142-phpapp01.pptx
acmsigtalkshare-121023190142-phpapp01.pptx
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Yoda fifth elephant
Yoda fifth elephantYoda fifth elephant
Yoda fifth elephant
 
Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?Hadoop & Greenplum: Why Do Such a Thing?
Hadoop & Greenplum: Why Do Such a Thing?
 
Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 

Cutting Edge Predictive Modeling For Classification

  • 1. Cutting Edge Predictive Modeling for Classification - A look at GBM in R Pankaj Sharma Oct 25, 2012
  • 2. The views expressed herein are my own and do not necessarily represent the views of my past / current employer or people I have worked with. My apologies if some sources have not have been sighted. I make the best attempt to include the relevant sources known to me. 1
  • 3. Performance vs. Complexity Predictive Performance Simple Complex Model Type Highly complex models have won numerous competitions – Netflix Prize 2
  • 4. Brief Mention of Modeling Techniques  Statistical Models - Linear regression - Logistic regression - Naïve Bayes  Semi-Parametric Models - Credit Scoring - GAM: generalized additive model - GNBC: generalized naïve bayes classifier  Algorithmic / Non-Parametric Models - MARS: multiple adaptive regression splines - Gradient Boosting / TreeNet - SVM: support vector machines - Random Forests - knn: k nearest neighbors 3
  • 5. Leo Breiman's Philosophy of Data Analysis  According to Leo Breiman (2001) there are two cultures in the use of statistical modeling to reach conclusions from data. - One assumes that the data are generated by a given stochastic data model. - The other uses algorithmic models and treats the data mechanism as unknown. • The statistical community has been committed to the almost exclusive use of data models. “How much are we prepared to let the data tell us about the process we are studying? For Breiman the answer would typically be ‘everything’, whereas for more traditional statisticians the answer would be far more qualified” – Dan Steinberg http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?handle=euclid.ss/1009213726&view=body&content-type=pdf_1 4
  • 6. Best of-the-shelf Predictive Model for Classification Missing values Variable scaling Correlated variables Model should be able to handle it all Variable transformations Categorical variables In 90’s - Decision Tree Today - Stochastic Gradient Boosting https://www.salford-systems.com/en/products/treenet 5
  • 7. Modeling in R – Some Available Functions Taken from Revolution R Presentation on Introduction to Data Mining 6
  • 8. Advanced Analytics begins in R Cutting Edge Data Mining Not difficult to learn R. Algorithms are usually In fact R is easier implemented in R first and intuitive than SAS Friedman, J. H. "Tutorial: Getting Started with MART in R." (April 2002) MART with R - MART(tm) is an implementation of the gradient tree boosting methods for predictive data mining (regression and classification) described in Greedy Function Approximation: a Gradient Boosting Machine and Stochastic Gradient Boosting. http://www-stat.stanford.edu/~jhf/R-MART.html 7
  • 9. KDD cup 2009 winner used R Entrants were given 50k records with 15k variables of CRM data from a telecommunication company and tasked with predicting three target variables: 1. Churn (likelihood of a customer switching to another provider) 2. Appetency (propensity to buy) 3. Likelihood of response to up-selling Fast Track Slow Track IBM Research University of Melbourne 1st prize Ensemble Selection for the KDD Cup Orange Challenge Boosting ID Analytics, Inc Financial Engineering Group, Inc. Japan 2nd prize KDD Cup Fast Scoring on a Large Database - TreeNet Stochastic Gradient Boosting Old dogs with new tricks (David Slate, Peter W. Frey) National Taiwan University, Computer Science and Information Engineering None Fast Scoring on a Large Database using regularized maximum entropy model, 3rd prize categorical/numerical balanced AdaBoost and selective Naive Bayes Fast challenge - complete in five days & Slow challenge – one month deadline from dataset availability 1, 2 and 3rd prize winners used some form of Gradient Boosting 8
  • 10. How to win the KDD Cup Challenge with R and GBM – Hugh Miller  Feature selection is an important first step for all successful data mining projects. - For categorical variables we just took the average number of 1's in the response for each category and used this as a predictor. - For continuous variables we split the variable up into "bins", as you would via histogram, and again took the average number of 1's in the response for each bin as the predictor.  The main model was a gradient boosted machine which used the "gbm" package in R. This basically fits a series of small decision trees, up-weighting the observations that are predicted poorly at each iteration. We used Bernoulli loss and also up-weighted the "1" response class. A fair amount of time was spent optimizing the number of trees, how big they should be etc, but a fit of 5,000 trees only took a bit over an hour to fit.  We used trees to avoid doing much data cleaning – they automatically allow for extreme results, non-linearity, missing values and handle both categorical and continuous variables. The main adjustment we had to make was to aggregate the smaller categories in the categorical variables, as they tended to distort the fits. http://www.cybaea.net/Blogs/Data/How-to-win-the-KDD-Cup-Challenge-with-R-and-gbm.html 9
  • 11. R / SAS Expertise The Tipping Point: How Little Things Make a Big Difference (2000) Blink: The Power of Thinking Without Thinking (2005) Outliers: The Story of Success (2008) Same logic applies in SAS – macros in SAS are similar to functions in R Taken from Revolution R Presentation on Introduction to Data Mining 10
  • 12. Case Study – Current Approach vs. GBM GBM shows promise of higher lift and response rate in the top decile of validation data (historical direct marketing program) when compared to the current approach Lift - current way Lift - GBM ks - current way ks - GBM 0.7 1 Lift and KS numbers are masked in this chart Lift KS 0 0 1 2 3 4 5 6 7 8 9 10 Decile # one way of building the gbm model in R library(gbm) mdl <- gbm (resp ~ ., data=t, distribution = "bernoulli", n.trees = 1000, interaction.depth = 1, n.minobsinnode = 20, shrinkage = 0.1, bag.fraction = 0.5, train.fraction = 1, keep.data = TRUE, verbose = TRUE, cv.folds=3) http://cran.r-project.org/web/packages/gbm/index.html 11
  • 13. My Journey from SAS to R 12