Upcoming SlideShare
×

# Predictive analytics

2,372 views
2,299 views

Published on

SAS Business Analytics 2011 - Predictive Analytics

Published in: Technology
3 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

Views
Total views
2,372
On SlideShare
0
From Embeds
0
Number of Embeds
115
Actions
Shares
0
120
0
Likes
3
Embeds 0
No embeds

No notes for slide
• Focus on three main aspects where data miner is key: Data prep, modelig and model monitoring
• Unfortunately data from the DW as it is may not always be the best info to use
• Using SEMMA methodology
• Basic modeling = use SAS EM, requires less stat knowledge a lot of modeling algorithms in SAS EMStick to simple models to be able to explain easily
• SAS EM ensemble: voting, averaging
• Categorical data modelingThe CATMOD procedure treats all explanatory (independent) variables as classification variables by default, and you specify continuous covariates in the DIRECT statement. The other procedures treat covariates as continuous by default, and you specify the classification variables in the CLASS statement. The CATMOD procedure provides weighted least squares estimation of many response functions, such as means, cumulative logits, and proportions, and you can also compute and analyze other response functions that can be formed from the proportions corresponding to the rows of a contingency table. In addition, a user can input and analyze a set of response functions and user-supplied covariance matrix with weighted least squares. PROC CATMOD also provides maximum likelihood estimation for binary and polytomous logistic regression. The GENMOD procedure is also a general statistical modeling tool which fits generalized linear models to data; it fits several useful models to categorical data including logistic regression, the proportional odds model, and Poisson and negative binomial regression for count data. The GENMOD procedure also provides a facility for fitting generalized estimating equations to correlated response data that are categorical, such as repeated dichotomous outcomes. The GENMOD procedure fits models using maximum likelihood estimation. PROC GENMOD can perform Type I and Type III tests, and it provides predicted values and residuals. Bayesian analysis capabilities for generalized linear models are also available. The GLIMMIX procedure fits many of the same models as the GENMOD procedure but also allows the inclusion of random effects. The GLIMMIX procedure fits models using maximum likelihood estimation. The LOGISTIC procedure is specifically designed for logistic regression. It performs the usual logistic regression analysis for dichotomous outcomes and it fits the proportional odds model and the generalized logit model for ordinal and nominal outcomes, respectively, by the method of maximum likelihood. This procedure has capabilities for a variety of model-building techniques, including stepwise, forward, and backward selection. It computes predicted values, the receiver operating characteristics (ROC) curve and the area beneath the curve, and a number of regression diagnostics. It can create output data sets containing these values and other statistics. PROC LOGISTIC can perform a conditional logistic regression analysis (matched-set and case-controlled) for binary response data. For small data sets, PROC LOGISTIC can perform exact conditional logistic regression. Firth’s bias-reducing penalized-likelihood method can also be used in place of conditional and exact conditional logistic regression. The PROBIT procedure is designed for quantal assay or other discrete event data. In additional to performing the logistic regression analysis, it can estimate the threshold response rate. PROC PROBIT can also estimate the values of independent variables that yield a desired response. The SURVEYLOGISTIC procedure performs logistic regression for binary, ordinal, and nominal responses under a specified complex sampling scheme, instead of the usual stratified simple random sampling. Survival analysis:There are three SAS procedures for analyzing survival data: LIFEREG, LIFETEST, andPHREG. PROC LIFEREG is a parametric regression procedure for modeling the distribution of survival time with a set of concomitant variables. PROC LIFETEST is a nonparametric procedure for estimating the survivor function, comparing the underlying survival curves of two or more samples, and testing the association of survival time with other variables. PROC PHREG is a semiparametric procedure that fits the Cox proportional hazards model and its extensions.
• ### Predictive analytics

2. 2. Agenda• What is predictive analytics?• Predictive Analytics Process• Data Preparation techniques• Modeling Techniques• Model Monitoring techniques 2 Copyright © 2011, SAS Institute Inc. All rights reserved.
4. 4. What is Predictive Analytics?Unfortunately, there is no “magic” involved!• Use of data from different source tables• Utilizing various data transformation techniques• Employing statistical theories as foundation• Will need software to manage thisFocus on business/commercial (as opposed to research) analytics is trickier as you need to balance the theories with realistic application 4 Copyright © 2011, SAS Institute Inc. All rights reserved.
6. 6. Data Preparation Techniques• Possible data sources• Data transformation techniques• Deriving “behavioral” information• Data quality check before modeling 6 Copyright © 2011, SAS Institute Inc. All rights reserved.
7. 7. Data Preparation TechniquesPossible data sources• Data warehouse/ data marts• Operational systems i.e. transaction systems, billing, call center data, etc• External data i.e. survey data, campaign, data from external agencies, etcFor external data make sure information is consistently available 7 Copyright © 2011, SAS Institute Inc. All rights reserved.
8. 8. Data Preparation TechniquesData transformation techniques• Entity-level information• Indicator variables • Are values skewed towards 1 level?• Categorization/grouping of values • Is there too many levels of values? • Are there values that rarely occur?• Binning of continuous variables• Benchmarking information, i.e. industry benchmarking 8 Copyright © 2011, SAS Institute Inc. All rights reserved.
9. 9. Data Preparation TechniquesDeriving “behavioral” information using several time periods• Average behavior over the last X time periods• Measures of variation • Standard deviation • Coefficient of Variation • Deviation from the Mean• Measures of trend information • Ratio of 1 vs 3, 3 vs 6 time periods • Proportion of Current vs Average of last X time periods • Slope of regression line 9 Copyright © 2011, SAS Institute Inc. All rights reserved.
10. 10. Data Preparation TechniquesData quality check before modeling• Generation of summary statistics of derived variables• Random checking• Correct imputation of missing values 10 Copyright © 2011, SAS Institute Inc. All rights reserved.
11. 11. Modeling Techniques• Use of SAS Enterprise Miner• Ensemble modeling outside of SAS• Base SAS modeling i.e. for categorical target, survival analysis, etc 11 Copyright © 2011, SAS Institute Inc. All rights reserved.
12. 12. Modeling TechniquesUse of SAS Enterprise Miner For initial /basic modeling, use Decision Tree, Regression. Neural networks can be used to provide diagnostic insights 12 Copyright © 2011, SAS Institute Inc. All rights reserved.
13. 13. Modeling TechniquesEnsemble modeling in and out of SAS EM Ensemble Models based on the Weightage following models Model 1 Decision 0.4 Model 2 Regression 0.6 Model 3 Regression 0.4 13 Copyright © 2011, SAS Institute Inc. All rights reserved.
14. 14. Modeling TechniquesBase SAS modeling• Categorical data modeling i.e. • PROC CATMOD/GENMOD • PROC SURVEYLOGISTIC• Survival analysis: • PROC LIFEREG • PROC LIFETEST • PROC PHREGBase SAS modeling requires more familiarity with underlying statistical concepts 14 Copyright © 2011, SAS Institute Inc. All rights reserved.
15. 15. Model Monitoring Techniques• Comparing actual vs predicted• Scored base analysis: • Variable distribution analysis • Predicted Score distribution 15 Copyright © 2011, SAS Institute Inc. All rights reserved.
16. 16. Model MonitoringMonitoring of model assessment charts i.e. measures what percentage of all churners Compares the effectiveness of running a are in the scoring list (i.e. top 10% scores model versus selecting randomly captured 40% of actual churners)Other model assessment statistics can be computed such as hit rate, Gini coefficient, etc 16 Copyright © 2011, SAS Institute Inc. All rights reserved.