Predictive Analytics
Advanced Techniques in Data Mining

Sara Venturina



                      Copyright © 2011, SAS Institute Inc. All rights reserved.
Agenda
• What is predictive analytics?

• Predictive Analytics Process

• Data Preparation techniques

• Modeling Techniques

• Model Monitoring techniques




                                                                                      2



                          Copyright © 2011, SAS Institute Inc. All rights reserved.
What is Predictive Analytics?
Different levels of analytics


                                                                      Forecasting               Predictive
                                                                                                modeling     Optimization
                                           Statistical
                                           analysis
                     Query drilldown Alerts
                     (or OLAP)
           Ad hoc
           reports
Standard
reports




                                                                                                                            3



                                    Copyright © 2011, SAS Institute Inc. All rights reserved.
What is Predictive Analytics?
Unfortunately, there is no “magic” involved!

• Use of data from different source tables
• Utilizing various data transformation techniques
• Employing statistical theories as foundation
• Will need software to manage this



Focus on business/commercial (as opposed to
 research) analytics is trickier as you need to
 balance the theories with realistic application


                                                                                    4



                        Copyright © 2011, SAS Institute Inc. All rights reserved.
Predictive Analytics Process


                                                Defining
                                               Objectives




            Model                                                                      Data
           Monitoring                                                               Preparation
                                              Predictive
                                              Analytics
                                               Process




                  Deployment                                                Modeling




                                                                                                  5



                        Copyright © 2011, SAS Institute Inc. All rights reserved.
Data Preparation Techniques
• Possible data sources
• Data transformation techniques
• Deriving “behavioral” information
• Data quality check before modeling




                                                                                  6



                      Copyright © 2011, SAS Institute Inc. All rights reserved.
Data Preparation Techniques
Possible data sources
• Data warehouse/ data marts
• Operational            systems                                                  i.e.   transaction
  systems, billing, call center data, etc
• External data i.e. survey data, campaign, data from
  external agencies, etc

For external data make sure information is consistently available




                                                                                                       7



                          Copyright © 2011, SAS Institute Inc. All rights reserved.
Data Preparation Techniques
Data transformation techniques
• Entity-level information
• Indicator variables
   • Are values skewed towards 1 level?

• Categorization/grouping of values
   • Is there too many levels of values?
   • Are there values that rarely occur?

• Binning of continuous variables
• Benchmarking information, i.e. industry benchmarking

                                                                                     8



                         Copyright © 2011, SAS Institute Inc. All rights reserved.
Data Preparation Techniques
Deriving “behavioral” information using several time
 periods
• Average behavior over the last X time periods
• Measures of variation
   • Standard deviation
   • Coefficient of Variation
   • Deviation from the Mean

• Measures of trend information
   • Ratio of 1 vs 3, 3 vs 6 time periods
   • Proportion of Current vs Average of last X time periods
   • Slope of regression line                                                         9



                          Copyright © 2011, SAS Institute Inc. All rights reserved.
Data Preparation Techniques
Data quality check before modeling
• Generation of summary statistics of derived variables
• Random checking
• Correct imputation of missing values




                                                                                 10



                     Copyright © 2011, SAS Institute Inc. All rights reserved.
Modeling Techniques
• Use of SAS Enterprise Miner
• Ensemble modeling outside of SAS
• Base SAS modeling i.e. for categorical target, survival
 analysis, etc




                                                                                 11



                     Copyright © 2011, SAS Institute Inc. All rights reserved.
Modeling Techniques
Use of SAS Enterprise Miner




     For initial /basic modeling, use Decision Tree, Regression.
      Neural networks can be used to provide diagnostic insights
                                                                                   12



                       Copyright © 2011, SAS Institute Inc. All rights reserved.
Modeling Techniques
Ensemble modeling in and out of SAS EM
                                         Ensemble Models based on the
                                                                      Weightage
                                               following models
                                             Model 1        Decision     0.4
                                             Model 2       Regression    0.6
                                             Model 3       Regression    0.4




                                                                                  13



                  Copyright © 2011, SAS Institute Inc. All rights reserved.
Modeling Techniques
Base SAS modeling
• Categorical data modeling i.e.
    • PROC CATMOD/GENMOD
    • PROC SURVEYLOGISTIC
• Survival analysis:
    • PROC LIFEREG
    • PROC LIFETEST
    • PROC PHREG

Base SAS modeling requires more familiarity with underlying statistical
 concepts
                                                                                      14



                          Copyright © 2011, SAS Institute Inc. All rights reserved.
Model Monitoring Techniques
• Comparing actual vs predicted
• Scored base analysis:
   • Variable distribution analysis
   • Predicted Score distribution




                                                                                  15



                      Copyright © 2011, SAS Institute Inc. All rights reserved.
Model Monitoring
Monitoring of model assessment charts i.e.
                                                                                measures what percentage of all churners
 Compares the effectiveness of running a                                        are in the scoring list (i.e. top 10% scores
   model versus selecting randomly                                                  captured 40% of actual churners)




Other model assessment statistics can be computed such as                                                                 hit
 rate, Gini coefficient, etc
                                                                                                                                 16



                                  Copyright © 2011, SAS Institute Inc. All rights reserved.
Model Monitoring (cont’d)
Scored base analysis i.e.
• Variable distribution analysis




                                                                                   17



                       Copyright © 2011, SAS Institute Inc. All rights reserved.
Model Monitoring (cont’d)
Scored base analysis i.e.
• Predicted Score distribution




                                                                                  18



                      Copyright © 2011, SAS Institute Inc. All rights reserved.
Predictive Analytics as an Iterative Process


                                                 Defining
                                                Objectives




             Model                                                                      Data
            Monitoring                                                               Preparation
                                               Predictive
                                               Analytics
                                                Process




                   Deployment                                                Modeling




                                                                                                   19



                         Copyright © 2011, SAS Institute Inc. All rights reserved.
Questions?




                                                                              20

                                                                         20
             Copyright © 2011, SAS Institute Inc. All rights reserved.
21

                                                            21
Copyright © 2011, SAS Institute Inc. All rights reserved.
Copyright © 2011, SAS Institute Inc. All rights reserved.

Predictive analytics

  • 1.
    Predictive Analytics Advanced Techniquesin Data Mining Sara Venturina Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 2.
    Agenda • What ispredictive analytics? • Predictive Analytics Process • Data Preparation techniques • Modeling Techniques • Model Monitoring techniques 2 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 3.
    What is PredictiveAnalytics? Different levels of analytics Forecasting Predictive modeling Optimization Statistical analysis Query drilldown Alerts (or OLAP) Ad hoc reports Standard reports 3 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 4.
    What is PredictiveAnalytics? Unfortunately, there is no “magic” involved! • Use of data from different source tables • Utilizing various data transformation techniques • Employing statistical theories as foundation • Will need software to manage this Focus on business/commercial (as opposed to research) analytics is trickier as you need to balance the theories with realistic application 4 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 5.
    Predictive Analytics Process Defining Objectives Model Data Monitoring Preparation Predictive Analytics Process Deployment Modeling 5 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 6.
    Data Preparation Techniques •Possible data sources • Data transformation techniques • Deriving “behavioral” information • Data quality check before modeling 6 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 7.
    Data Preparation Techniques Possibledata sources • Data warehouse/ data marts • Operational systems i.e. transaction systems, billing, call center data, etc • External data i.e. survey data, campaign, data from external agencies, etc For external data make sure information is consistently available 7 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 8.
    Data Preparation Techniques Datatransformation techniques • Entity-level information • Indicator variables • Are values skewed towards 1 level? • Categorization/grouping of values • Is there too many levels of values? • Are there values that rarely occur? • Binning of continuous variables • Benchmarking information, i.e. industry benchmarking 8 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 9.
    Data Preparation Techniques Deriving“behavioral” information using several time periods • Average behavior over the last X time periods • Measures of variation • Standard deviation • Coefficient of Variation • Deviation from the Mean • Measures of trend information • Ratio of 1 vs 3, 3 vs 6 time periods • Proportion of Current vs Average of last X time periods • Slope of regression line 9 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 10.
    Data Preparation Techniques Dataquality check before modeling • Generation of summary statistics of derived variables • Random checking • Correct imputation of missing values 10 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 11.
    Modeling Techniques • Useof SAS Enterprise Miner • Ensemble modeling outside of SAS • Base SAS modeling i.e. for categorical target, survival analysis, etc 11 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 12.
    Modeling Techniques Use ofSAS Enterprise Miner For initial /basic modeling, use Decision Tree, Regression. Neural networks can be used to provide diagnostic insights 12 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 13.
    Modeling Techniques Ensemble modelingin and out of SAS EM Ensemble Models based on the Weightage following models Model 1 Decision 0.4 Model 2 Regression 0.6 Model 3 Regression 0.4 13 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 14.
    Modeling Techniques Base SASmodeling • Categorical data modeling i.e. • PROC CATMOD/GENMOD • PROC SURVEYLOGISTIC • Survival analysis: • PROC LIFEREG • PROC LIFETEST • PROC PHREG Base SAS modeling requires more familiarity with underlying statistical concepts 14 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 15.
    Model Monitoring Techniques •Comparing actual vs predicted • Scored base analysis: • Variable distribution analysis • Predicted Score distribution 15 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 16.
    Model Monitoring Monitoring ofmodel assessment charts i.e. measures what percentage of all churners Compares the effectiveness of running a are in the scoring list (i.e. top 10% scores model versus selecting randomly captured 40% of actual churners) Other model assessment statistics can be computed such as hit rate, Gini coefficient, etc 16 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 17.
    Model Monitoring (cont’d) Scoredbase analysis i.e. • Variable distribution analysis 17 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 18.
    Model Monitoring (cont’d) Scoredbase analysis i.e. • Predicted Score distribution 18 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 19.
    Predictive Analytics asan Iterative Process Defining Objectives Model Data Monitoring Preparation Predictive Analytics Process Deployment Modeling 19 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 20.
    Questions? 20 20 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 21.
    21 21 Copyright © 2011, SAS Institute Inc. All rights reserved.
  • 22.
    Copyright © 2011,SAS Institute Inc. All rights reserved.

Editor's Notes

  • #6 Focus on three main aspects where data miner is key: Data prep, modelig and model monitoring
  • #9 Unfortunately data from the DW as it is may not always be the best info to use
  • #12 Using SEMMA methodology
  • #13  Basic modeling = use SAS EM, requires less stat knowledge a lot of modeling algorithms in SAS EMStick to simple models to be able to explain easily
  • #14 SAS EM ensemble: voting, averaging
  • #15 Categorical data modelingThe CATMOD procedure treats all explanatory (independent) variables as classification variables by default, and you specify continuous covariates in the DIRECT statement. The other procedures treat covariates as continuous by default, and you specify the classification variables in the CLASS statement. The CATMOD procedure provides weighted least squares estimation of many response functions, such as means, cumulative logits, and proportions, and you can also compute and analyze other response functions that can be formed from the proportions corresponding to the rows of a contingency table. In addition, a user can input and analyze a set of response functions and user-supplied covariance matrix with weighted least squares. PROC CATMOD also provides maximum likelihood estimation for binary and polytomous logistic regression. The GENMOD procedure is also a general statistical modeling tool which fits generalized linear models to data; it fits several useful models to categorical data including logistic regression, the proportional odds model, and Poisson and negative binomial regression for count data. The GENMOD procedure also provides a facility for fitting generalized estimating equations to correlated response data that are categorical, such as repeated dichotomous outcomes. The GENMOD procedure fits models using maximum likelihood estimation. PROC GENMOD can perform Type I and Type III tests, and it provides predicted values and residuals. Bayesian analysis capabilities for generalized linear models are also available. The GLIMMIX procedure fits many of the same models as the GENMOD procedure but also allows the inclusion of random effects. The GLIMMIX procedure fits models using maximum likelihood estimation. The LOGISTIC procedure is specifically designed for logistic regression. It performs the usual logistic regression analysis for dichotomous outcomes and it fits the proportional odds model and the generalized logit model for ordinal and nominal outcomes, respectively, by the method of maximum likelihood. This procedure has capabilities for a variety of model-building techniques, including stepwise, forward, and backward selection. It computes predicted values, the receiver operating characteristics (ROC) curve and the area beneath the curve, and a number of regression diagnostics. It can create output data sets containing these values and other statistics. PROC LOGISTIC can perform a conditional logistic regression analysis (matched-set and case-controlled) for binary response data. For small data sets, PROC LOGISTIC can perform exact conditional logistic regression. Firth’s bias-reducing penalized-likelihood method can also be used in place of conditional and exact conditional logistic regression. The PROBIT procedure is designed for quantal assay or other discrete event data. In additional to performing the logistic regression analysis, it can estimate the threshold response rate. PROC PROBIT can also estimate the values of independent variables that yield a desired response. The SURVEYLOGISTIC procedure performs logistic regression for binary, ordinal, and nominal responses under a specified complex sampling scheme, instead of the usual stratified simple random sampling. Survival analysis:There are three SAS procedures for analyzing survival data: LIFEREG, LIFETEST, andPHREG. PROC LIFEREG is a parametric regression procedure for modeling the distribution of survival time with a set of concomitant variables. PROC LIFETEST is a nonparametric procedure for estimating the survivor function, comparing the underlying survival curves of two or more samples, and testing the association of survival time with other variables. PROC PHREG is a semiparametric procedure that fits the Cox proportional hazards model and its extensions.