SlideShare a Scribd company logo
1 of 71
Download to read offline
Copyright © 2011 Clarity Solution Group 
Predictive Modeling with Enterprise Miner 
Jeffrey Strickland, Ph.D. 
Senior Consultant
Copyright © 2012 Clarity Solution Group 
Learning Objectives 
•To understand the application of regression analysis in data mining 
•Linear/nonlinear 
•Logistic (Logit) 
•To understand the key statistical measures of fit 
•To learn how to run and interpret regression analyses using SAS Enterprise Miner software
Copyright © 2012 Clarity Solution Group 
SAS Enterprise Miner 
•These results can be obtained using Excel or using a data mining package such as SAS Enterprise Miner 5.3 
•Using SAS Enterprise Miner requires the following steps: 
•Convert your data (usually in an Excel file) into a SAS data file Using SAS 9.1 
•Create a project in Enterprise Miner 
•Within the project: 
•Create a data source using your SAS data file 
•Create a diagram that includes a data node and a regression node and a multiplot node for graphs 
•Run the model in the diagram and review the results
Copyright © 2012 Clarity Solution Group 
Creating a SAS data file from an Excel file: open SAS 9.1. Select Filethen Import Data
Copyright © 2012 Clarity Solution Group 
This opens the import wizard. Since the source file is from Excel, click Next. Then click Browseto find the TempKWatts.xlsfile
Copyright © 2012 Clarity Solution Group 
Since the data are on sheet1$, click Next. Then enter SASUSERas the Library and TEMPKILOWATTL as the Member. Then click Next
Copyright © 2012 Clarity Solution Group 
Now click Finish to create your file
Copyright © 2012 Clarity Solution Group 
Open SAS Enterprise Miner 5.3. Enter the user name and password provided
Copyright © 2012 Clarity Solution Group 
The Enterprise Window below opens. Select New Project
Copyright © 2012 Clarity Solution Group 
The Create New Project dialog box appears. Select the Generaltab, then type the short name of the project, e.g., KWattTemp0. Keep the default path.
Copyright © 2012 Clarity Solution Group 
In the Startup code tab, enter: libname Ktemps "C:Documents and SettingsmliberatMy DocumentsMy SAS Files9.1EM_Projects"; This code will be run each time you open the project
Copyright © 2012 Clarity Solution Group 
The Enterprise Miner application window opens
Copyright © 2012 Clarity Solution Group 
Right-click on Data Source, opening the wizard. Source is SAS table, so click Next
Copyright © 2012 Clarity Solution Group 
Browsethe SAS libraries to find the SAS table Tempkilowattlfound in the SASuser Library (previously created)
Copyright © 2012 Clarity Solution Group 
Click Nexttwice. Note that the Table properties shows that we have two variables with 12 observations
Copyright © 2012 Clarity Solution Group 
The next step controls how Enterprise Miner organizes metadata for the variables in your data. Select advanced, then click next(you can view/change the settings if you click Customizebefore clicking Next)
Copyright © 2012 Clarity Solution Group 
Change Roleof KWattsto target(outcome variable); change Levelof both KWatts and Temp to interval(continuous values); then click Next (Other levels are possible, such as binary). You can click on Explore if you wish to look at some basic stats –we will do this later
Copyright © 2012 Clarity Solution Group 
Here Role relates to the role of the data set (raw, train, validate, score); raw is fine for our analysis of data, so click Finish
Copyright © 2012 Clarity Solution Group 
Tempkilowattlnow appears under Data Sourcesin the top left panel called the Project Panel
Copyright © 2012 Clarity Solution Group 
We need to create a Diagramfor our model. Right-click on Diagrams, then enter TempKwatts0in the dialog box. Now the left panel shows TempKwatts0as a Diagram, and the right- hand panel is called the Diagram Workspace. Icons can be dragged and dropped onto the Diagram Workspace.
Copyright © 2012 Clarity Solution Group 
Now add an Input Data Node to the Diagram. From the Data Sourceslist in the Project Panel drag and drop the Data SourceTempKwatts0 onto the Diagram Workspace. Note that when input data node is highlighted, various properties are displayed on the left-hand panel.
Copyright © 2012 Clarity Solution Group 
If you wish to see the properties of any or all of the variables, highlight the input data node; then on the left hand Properties Panel under Train, click on the box to the right of Variables; in the screen that opens control-click on KWattsand Temp; then click onExplore in the lower right
Copyright © 2012 Clarity Solution Group 
Frequency distributions for the variables and the raw data are provided. Right-clicking on observations in the lower-left panel will show where they appear in the bar charts. Cancel when finished.
Copyright © 2012 Clarity Solution Group 
Click on the Exploretab found over the Diagram Workspace, and then drag and drop the Multiploticon onto the field. Using your cursor, draw a directed arrow from the TempKwattslicon to the Multiploticon. With the Multiploticon highlighted, its properties are found in the left-hand Properties Panel.
Copyright © 2012 Clarity Solution Group 
Right-click on the Multiplot iconand select Run. After the run is completed select Results from the Run Status window.
Copyright © 2012 Clarity Solution Group 
Various charts are available as shown below. Descriptive statistics for each variable are given in the lower pane.
Copyright © 2012 Clarity Solution Group 
Click on the Modeltab and drag the Regressionicon onto the Model field. Connect the Tempkwattslicon to the Regressionicon. Highlight the Regressionicon and on the Property Panel change Regression Type to linear regression.
Copyright © 2012 Clarity Solution Group 
Run the Regressionand select Results. Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R-square), and the two effects (intercept and slope –bars represent size and color represents direction)
Copyright © 2012 Clarity Solution Group 
For a given percentile, the Target Mean is the actual (or estimated value based on actuals), or what you are trying to predict; the Mean for Predicted is the forecasted values, or the predictions (or estimated values based on forecasts). The results are shown from highest to lowest forecasted values. The distances between the curves shows how well the model predicts the actual data.
Copyright © 2012 Clarity Solution Group 
A variety of fit statistics are provided. These include SSE, MSE=SSE/(n-2), ASE=SSE/n, RMSE=SQRT(MSE), RASE=SQRT(SSE), FPE = MSE (n+p+1)/n, MAX = largest error in terms of absolute value, where n = no. of observations, p=no. of variables in model (one in our case). Schwartz’s Bayesian Criterion and Akaike’s Information Criterion are used for model selection (comparing one model to another). Schwartz’s adjusts the residual squared error for the number of parameters estimated, while Akaike’s is a relative measure of information lost from fitting the model.
Copyright © 2012 Clarity Solution Group 
Kwatts vs. Temp Example 2 
•Another approach to modeling the relationship between Kwatts and Temp is to use a nonlinear regression 
•This is easily accomplished in Enterprise Miner –highlight the regression node, then in the left hand panel select yes for polynomial terms 
•We use the default of two terms 
•Is the fit any better???
Copyright © 2012 Clarity Solution Group
Copyright © 2012 Clarity Solution Group 
Multiple Regression 
Consider the following data relating family size and income to food expenditures: 
familyfood $income $ family size 
15.2283 
25.1263 
35.6322 
44.6241 
511.3544 
68.1592 
77.8443 
85.8302 
95.1401 
1018826 
114.9423 
1211.8584 
135.2281 
144.8205 
157.9423 
166.4471 
17201126 
1813.7855 
195.1312 
202.9262
Copyright © 2012 Clarity Solution Group 
Multiple Regression 
•We can run this problem in Enterprise Miner using the same approach followed with the previous example 
•On our model field we have placed the data source called foodexpenditures, and also bothMultiplot andStatExplore found under the Exploretab above the model field 
•Highlight foodexpenditures, then in the left-hand panel under Training, find variablesand click on the box to the right to open up the variables 
•Change the roleof familyto rejected(it is just the number of the observation) and change the levelof food_to target, and income_, food_,and fam_sizeto interval, then clickOK
Copyright © 2012 Clarity Solution Group 
Foodexpenditures Model
Copyright © 2012 Clarity Solution Group 
Highlight the StatExplorenode, right-click to Run,then select Results. Correlations between the input variables and the target are provided, along with basic statistics. The input variables are ordered by the size of the correlations. Now close out the results window and run the regressionnode and obtain results
Copyright © 2012 Clarity Solution Group 
Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R-square), and the three effects (intercept and slopes for the two input variables with bars represent size and color represents direction). The model is significant and is a good fit with the data.
Copyright © 2012 Clarity Solution Group 
What happens in regression analysis when the target variable is binary? 
•There are many situations when the target variable is binary –some examples: 
•whether a customer will or will not receive credit 
•whether a customer will or will not response to a promotion 
•Whether a firm will go bankrupt in a year 
•Whether a student will pass an exam!!!
Copyright © 2012 Clarity Solution Group 
Passing an Exam Data 
Student idOutcomeStudy Hours 
103 
2134 
3017 
406 
5012 
6115 
7126 
8129 
9014 
10158 
1102 
12131 
13126 
14011
Copyright © 2012 Clarity Solution Group 
Running a linear regression to predict pass/don’t pass as a function of hours of study provides a model that doesn’t correctly model the data. The data are given in exampassing.xls 
Passing an Exam00.20.40.60.811.21.41.6010203040506070hours of study pass or don't pass ActualPredicted
Copyright © 2012 Clarity Solution Group 
The Enterprise Miner results show a poor fit on a percentile basis between predicted and target –another modeling approach is needed.
Copyright © 2012 Clarity Solution Group 
Logistic Regression 
•Similar to linear regression, two main differences 
•Y (outcome or response) is categorical 
•Yes/No 
•Approve/Reject 
•Responded/Did not respond 
•Result is expressed as a probabilityof being in either group.
Copyright © 2012 Clarity Solution Group 
Comparing the Logistic & Linear Regression Models
Copyright © 2012 Clarity Solution Group 
Logisitic regression 
p = Prob(y=1|x) = exp(a+bx)/[1+exp(a+bx)] 
1-p =1/[1+exp(a+bx)] 
ln [p/(1-p)] = a + bx 
where: 
exp ore is the exponential function(e=2.71828…) 
lnis the natural logarithm (ln(e) = 1) 
p is probability that the event y occurs given x, and can range between 0 and 1 
p/(1-p) is the "odds ratio" 
ln[p/(1-p)] is the log odds ratio, or "logit" 
all other components of the regression model are the same
Copyright © 2012 Clarity Solution Group 
Odds Ratio 
•Frequently used 
•Related to probability of an event as follows: Odds Ratio = p/(1-p) 
•Example: 
•Probability of firm going bankrupt = .25 
•Odds firm will go bankrupt = .25/(1-.25) = 1/3 or 3 to 1 
•This is how sports books calculate odds 
•(e.g., if odds of VU winning a championship are 2:1, probability is 1/3 
•ln [p/(1-p)] = a + bx means that as x increases by 1, the natural log of the odds ratio increases by b, or the odds ratio increase by a factor of exp(b)
Copyright © 2012 Clarity Solution Group 
Probability, Odds Ratio, LN of Odds Ratio 
-50510152025 0.050.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.850.90.95 probabilityoddsnl(odds)
Copyright © 2012 Clarity Solution Group 
Running the exam data: Change regression type from linear regressionto logistic regressionHighlight the data node; on left-hand panel under Trainopen variablesand change the levelof outcometo binary
Copyright © 2012 Clarity Solution Group 
Results show a much better fit (upper left) and only one misclassification (lower right –a false negative).
Copyright © 2012 Clarity Solution Group 
The results show that the odds ratio = p(1-p) = exp(- 8.4962+0.4949x). For every additional hour of study the odds ratio increases by a factor of exp(0.4949)= 1.640
Copyright © 2012 Clarity Solution Group 
Understanding Response Rate and Lift 
To better understand the top left chart, change cumulative liftto cumulative % response. The observations are ranked by the predicted probability of response (highest to lowest) for each observation (from the fitted model).
Copyright © 2012 Clarity Solution Group 
Understanding Response Rate and Lift 
•Since the first 6 passes were correctly classified, the cumulative % response is 100% through the 40thpercentile. 
•At the 50thpercentile the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 6/7 or 85.7%. 
•The 8thranked observation, between the 55thand 60thpercentile, is a positive response, so the cumulative % response is about 7/8 or 87%. 
•Since there are no more positive responses after the 60thpercentile, the cumulative response rate will drop to 50%. 
•The chart compares how well the cumulative ranked predictions lead to a match between actual and predicted responses
Copyright © 2012 Clarity Solution Group 
Understanding Response Rate and Lift 
•Lift calculates the ratio of the actual response rate (passing) of the top n% of the ranked observations to the overall response rate. Cumulative lift is likewise defined. 
•At the 50thpercentile, the cumulative % response is 88.7%, the cumulative base response is 50%, for a lift of 1.7142.
Copyright © 2012 Clarity Solution Group 
On the Properties Panel, click on Exported Datato see the predicted probabilities and response for each observation and compare to the actual response.
Copyright © 2012 Clarity Solution Group 
Logistic regression uses maximum likelihood (and not sum of squared errors) to estimate the model parameters. The results below show that the model is highly significant based on a chi-square test. The Wald chi-square statistic tests whether an effect is significant or not.
Copyright © 2012 Clarity Solution Group 
Bankruptcy Prediction 
•To predict bankruptcy a year in advance, you might collect: 
•working capital/total assets (WC/TA) 
•retained earnings/total assets (RE/TA) 
•earnings before interest and taxes/total assets (EBIT/TA) 
•market value of equity/total debt (MVE/TD) 
•sales/total assets (S/TA)
Copyright © 2012 Clarity Solution Group 
Bankruptcy Training Data 
FirmWC/TARE/TAEBIT/TAMVE/TDS/TABR/NB 
10.01650.11920.20350.8131.67021 
20.14150.38680.06810.57551.05791 
30.58040.33310.0810.57551.05791 
40.23040.2960.12250.41023.08091 
50.36840.39130.05240.16581.15331 
60.15270.33440.07830.77361.50461 
70.11260.30710.08391.34291.57361 
80.01410.23660.09050.58631.46511 
90.2220.17970.15260.34591.72371 
100.27760.25670.16420.29681.89041 
110.26890.17290.02870.12240.92770 
120.2039-0.04760.12630.89651.04570 
130.5056-0.19510.20260.5381.95140 
140.17590.13430.09460.19551.92180 
150.35790.15150.08120.19911.45820 
160.28450.20380.01710.33571.32580 
170.12090.2823-0.01130.31572.32190 
180.12540.19560.00790.20731.4890 
190.17770.08910.06950.19241.68710 
200.24090.1660.07460.25161.85240
Copyright © 2012 Clarity Solution Group 
Bankruptcy Example 
•Using the BankruptTrain.xls data create a SAS data file called bankrupt 
•BR_NB: roleis targetand levelis binary 
•Firm: roleis rejectedand level is nominal(it is simply the firm number) 
•Remaining five financial ratio variables: roleis inputand levelis interval
Copyright © 2012 Clarity Solution Group 
Create a diagram named bankrupt1. Drag and drop the data node onto the model. Highlight the data node and on the left hand panel under variables click on the box to its right to see the variables data
Copyright © 2012 Clarity Solution Group 
From the Exploretab drag and drop the StatExplorenode onto the diagram and link it to the bankruptnode. Highlight the StatExplorenode, right-click and run it, and obtain results. On top, correlations between the five input variables and the target are shown via bars ordered from largest to smallest. Below the mean variable score for bankrupt vs. non-bankrupt observations is shown.
Copyright © 2012 Clarity Solution Group 
From the Modeltab drag and drop the regressionnode onto the diagram and connect it to the bankruptnode. Highlight the regressionnode and run, and obtain the results
Copyright © 2012 Clarity Solution Group 
The results show that the model fits the data very well with highly significant overall chi square statistic, low error values, and 0 misclassifications. Cumulative lift shows that for the top 50% of observations that are bankrupt, they are twice as likely to be classified as bankrupt.
Copyright © 2012 Clarity Solution Group 
Scoring 
•Once you have specified a model you might wish to apply it to new data whose outcome is unknown --make predictions 
•This can be easily accomplished in Enterprise Miner using scoring 
•Convert the data set BankruptScore.xls to a SAS file called bankruptscore. The roleof this data is score.
Copyright © 2012 Clarity Solution Group 
Bankruptcy Scoring Data 
FirmWC/TARE/TAEBIT/TAMVE/TDS/TA 
A0.17590.13430.09560.19551.9218 
B0.37320.3483-0.00130.34831.8223 
C0.17250.32380.1040.88470.5576 
D0.1630.35550.0110.3732.8307 
E0.19040.20110.13290.5581.6623 
F0.11230.22880.010.18842.7186 
G0.07320.35260.05870.23491.7432 
H0.26530.26830.02350.51181.835 
I0.1070.07870.04330.10831.2051 
J0.29210.2390.96730.34020.9277
Copyright © 2012 Clarity Solution Group 
Drag and drop the bankruptscoredata node to the bankrupt1 diagram. From the Assesstab, drag and drop the Scorenode into the diagram. Link the regressionand bankruptscorenodes together and connect them to the Score node.
Copyright © 2012 Clarity Solution Group 
Run the Scorenode and obtain the Results. Of the 10 firms, 6 are predicted to become bankrupt.
Copyright © 2012 Clarity Solution Group 
For details about the individual predictions, highlight the Scorenode and on the left-hand panel click on the square to the right of Exported Data. Then in the box that appears click on the row whose Port entry is Score. Then click on Explore.
Copyright © 2012 Clarity Solution Group 
The lower portion of the output is shown below. The predictions are given, along with the probabilities of the firm becoming bankrupt or not.
Copyright © 2012 Clarity Solution Group 
Regression Using Selection Models 
•When there are a number of possible input variables, procedures are available to sort through them and include those that have a certain level of statistical significance 
•SAS Enterprise Miner 5.3 offers three selection methods: 
•Backward 
•Forward 
•Stepwise
Copyright © 2012 Clarity Solution Group 
Regression Using Selection Models 
•Backward: training begins with all candidate effects in the model and removes effects until the stay significance levelor the stop criterion is met 
•Forward: training begins with no candidate effects in the model and adds effects until the entry significance levelor the stop criterion is met. 
•Stepwise:training begins as in the forward model but may remove effects already in the model. This continues until the stay significance levelor the stop criterion is met 
Note that the default significance levels (p values) values are 0.05 and no stop criteria (such as maximum number of steps in the regression) are set
Copyright © 2012 Clarity Solution Group 
Regression Using Selection Models –Bankruptcy Model 
To select stepwise regression 
for the bankruptcy model, highlight 
the regression node and in the 
properties panel under 
Selection Model choose 
Stepwise. The default significance 
level of 0.05 is used
Copyright © 2012 Clarity Solution Group 
Regression Using Selection Models –Bankruptcy Model 
•Interestingly, the Training Model only uses RE/TA as a predictor 
•There are 3 misclassifications (.15 rate) in this set vs. 0 in the original model 
•The results are very different: the original model with all 5 input variables predicted bankruptcy for G, E, C, and J, while the stepwise model predicted B, C, D, F, G, H, and J would become bankrupt. 
•Changing the significance levels to 0.1 (to make it easier for input variables to enter/leave the stepwise model) produces the same results

More Related Content

What's hot

Tableau interview questions and answers
Tableau interview questions and answersTableau interview questions and answers
Tableau interview questions and answerskavinilavuG
 
Tableau interview questions-ppt
 Tableau interview questions-ppt Tableau interview questions-ppt
Tableau interview questions-pptMayank Kumar
 
Crystal xcelsius best practices and workflows for building enterprise solut...
Crystal xcelsius   best practices and workflows for building enterprise solut...Crystal xcelsius   best practices and workflows for building enterprise solut...
Crystal xcelsius best practices and workflows for building enterprise solut...Yogeeswar Reddy
 
E-R vs Starschema
E-R vs StarschemaE-R vs Starschema
E-R vs Starschemaguest862640
 
Difference between ER-Modeling and Dimensional Modeling
Difference between ER-Modeling and Dimensional ModelingDifference between ER-Modeling and Dimensional Modeling
Difference between ER-Modeling and Dimensional ModelingAbdul Aslam
 
Excel Secrets for Search Marketers
Excel Secrets for Search MarketersExcel Secrets for Search Marketers
Excel Secrets for Search MarketersChris Haleua
 
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...PAPIs.io
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional ModelingSunita Sahu
 
Tableau Interview Questions & Answers | Tableau Interview Questions | Tableau...
Tableau Interview Questions & Answers | Tableau Interview Questions | Tableau...Tableau Interview Questions & Answers | Tableau Interview Questions | Tableau...
Tableau Interview Questions & Answers | Tableau Interview Questions | Tableau...Simplilearn
 
Elementary Data Analysis with MS Excel_Day-2
Elementary Data Analysis with MS Excel_Day-2Elementary Data Analysis with MS Excel_Day-2
Elementary Data Analysis with MS Excel_Day-2Redwan Ferdous
 
High score plus guide (3)
High score plus guide (3)High score plus guide (3)
High score plus guide (3)SUTAPA DEB
 
Data warehousing unit 4.2
Data warehousing unit 4.2Data warehousing unit 4.2
Data warehousing unit 4.2WE-IT TUTORIALS
 
Bbs11 ppt ch06
Bbs11 ppt ch06Bbs11 ppt ch06
Bbs11 ppt ch06Tuul Tuul
 
Tableau interview questions www.bigclasses.com
Tableau interview questions www.bigclasses.comTableau interview questions www.bigclasses.com
Tableau interview questions www.bigclasses.combigclasses.com
 

What's hot (16)

Tableau interview questions and answers
Tableau interview questions and answersTableau interview questions and answers
Tableau interview questions and answers
 
Tableau interview questions-ppt
 Tableau interview questions-ppt Tableau interview questions-ppt
Tableau interview questions-ppt
 
Crystal xcelsius best practices and workflows for building enterprise solut...
Crystal xcelsius   best practices and workflows for building enterprise solut...Crystal xcelsius   best practices and workflows for building enterprise solut...
Crystal xcelsius best practices and workflows for building enterprise solut...
 
E-R vs Starschema
E-R vs StarschemaE-R vs Starschema
E-R vs Starschema
 
Difference between ER-Modeling and Dimensional Modeling
Difference between ER-Modeling and Dimensional ModelingDifference between ER-Modeling and Dimensional Modeling
Difference between ER-Modeling and Dimensional Modeling
 
Star schema PPT
Star schema PPTStar schema PPT
Star schema PPT
 
Excel Secrets for Search Marketers
Excel Secrets for Search MarketersExcel Secrets for Search Marketers
Excel Secrets for Search Marketers
 
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
Machine Learning Performance Evaluation: Tips and Pitfalls - Jose Hernandez O...
 
Dimensional Modeling
Dimensional ModelingDimensional Modeling
Dimensional Modeling
 
Data preparation
Data preparationData preparation
Data preparation
 
Tableau Interview Questions & Answers | Tableau Interview Questions | Tableau...
Tableau Interview Questions & Answers | Tableau Interview Questions | Tableau...Tableau Interview Questions & Answers | Tableau Interview Questions | Tableau...
Tableau Interview Questions & Answers | Tableau Interview Questions | Tableau...
 
Elementary Data Analysis with MS Excel_Day-2
Elementary Data Analysis with MS Excel_Day-2Elementary Data Analysis with MS Excel_Day-2
Elementary Data Analysis with MS Excel_Day-2
 
High score plus guide (3)
High score plus guide (3)High score plus guide (3)
High score plus guide (3)
 
Data warehousing unit 4.2
Data warehousing unit 4.2Data warehousing unit 4.2
Data warehousing unit 4.2
 
Bbs11 ppt ch06
Bbs11 ppt ch06Bbs11 ppt ch06
Bbs11 ppt ch06
 
Tableau interview questions www.bigclasses.com
Tableau interview questions www.bigclasses.comTableau interview questions www.bigclasses.com
Tableau interview questions www.bigclasses.com
 

Viewers also liked (11)

R version 3 Examples
R version 3 ExamplesR version 3 Examples
R version 3 Examples
 
Seas sr study
Seas sr studySeas sr study
Seas sr study
 
Forrest Gump and the Joint Distribution
Forrest Gump and the Joint DistributionForrest Gump and the Joint Distribution
Forrest Gump and the Joint Distribution
 
Knights of the Cross - preview
Knights of the Cross - previewKnights of the Cross - preview
Knights of the Cross - preview
 
MAPH - Math & Physics and Interdisciplinary Course
MAPH - Math & Physics and Interdisciplinary CourseMAPH - Math & Physics and Interdisciplinary Course
MAPH - Math & Physics and Interdisciplinary Course
 
Using math to defeat the enemy (7 7-2011)
Using math to defeat the enemy (7 7-2011)Using math to defeat the enemy (7 7-2011)
Using math to defeat the enemy (7 7-2011)
 
Training_backed_by_Experience
Training_backed_by_ExperienceTraining_backed_by_Experience
Training_backed_by_Experience
 
Propensity models with logistic regression clarity
Propensity models with logistic regression clarityPropensity models with logistic regression clarity
Propensity models with logistic regression clarity
 
predictive models
predictive modelspredictive models
predictive models
 
Predictive Analytics using R
Predictive Analytics using RPredictive Analytics using R
Predictive Analytics using R
 
Knights of the Cross - preview
Knights of the Cross - previewKnights of the Cross - preview
Knights of the Cross - preview
 

Similar to Predictive Modeling with Enterprise Miner

Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolioeileensauer
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolioeileensauer
 
Developing a ssrs report using a ssas data source
Developing a ssrs report using a ssas data sourceDeveloping a ssrs report using a ssas data source
Developing a ssrs report using a ssas data sourcerelekarsushant
 
1505 Statistical Thinking course extract
1505 Statistical Thinking course extract1505 Statistical Thinking course extract
1505 Statistical Thinking course extractJefferson Lynch
 
Week 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docxWeek 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docxcockekeshia
 
How to design a report with fine report reporting tool
How to design a report with  fine report reporting toolHow to design a report with  fine report reporting tool
How to design a report with fine report reporting toolFineReport Reporting Tool
 
Portfolio For Charles Tontz
Portfolio For Charles TontzPortfolio For Charles Tontz
Portfolio For Charles Tontzctontz
 
Obiee interview questions and answers faq
Obiee interview questions and answers faqObiee interview questions and answers faq
Obiee interview questions and answers faqmaheshboggula
 
The 7 basic quality tools through minitab 18
The 7 basic quality tools through minitab 18The 7 basic quality tools through minitab 18
The 7 basic quality tools through minitab 18RAMAR BOSE
 
Informatica complex transformation ii
Informatica complex transformation iiInformatica complex transformation ii
Informatica complex transformation iiAmit Sharma
 
Examine Statistics in IBM SPSS Modeler.pptx
Examine Statistics in IBM SPSS Modeler.pptxExamine Statistics in IBM SPSS Modeler.pptx
Examine Statistics in IBM SPSS Modeler.pptxVersion 1 Analytics
 
Top tableau questions and answers in 2019
Top tableau questions and answers in 2019Top tableau questions and answers in 2019
Top tableau questions and answers in 2019minatibiswal1
 
Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...Aeric Poon
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network ModelEric Esajian
 
Datastream professional getting started guide
Datastream professional   getting started guideDatastream professional   getting started guide
Datastream professional getting started guideIta Kamis
 
Visualize your Twitter network
Visualize your Twitter networkVisualize your Twitter network
Visualize your Twitter networkVerkostoanatomia
 

Similar to Predictive Modeling with Enterprise Miner (20)

Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
Developing a ssrs report using a ssas data source
Developing a ssrs report using a ssas data sourceDeveloping a ssrs report using a ssas data source
Developing a ssrs report using a ssas data source
 
1505 Statistical Thinking course extract
1505 Statistical Thinking course extract1505 Statistical Thinking course extract
1505 Statistical Thinking course extract
 
Week 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docxWeek 2 Project - STAT 3001Student Name Type your name here.docx
Week 2 Project - STAT 3001Student Name Type your name here.docx
 
How to design a report with fine report reporting tool
How to design a report with  fine report reporting toolHow to design a report with  fine report reporting tool
How to design a report with fine report reporting tool
 
Portfolio For Charles Tontz
Portfolio For Charles TontzPortfolio For Charles Tontz
Portfolio For Charles Tontz
 
Mca 504 dotnet_unit5
Mca 504 dotnet_unit5Mca 504 dotnet_unit5
Mca 504 dotnet_unit5
 
Obiee interview questions and answers faq
Obiee interview questions and answers faqObiee interview questions and answers faq
Obiee interview questions and answers faq
 
The 7 basic quality tools through minitab 18
The 7 basic quality tools through minitab 18The 7 basic quality tools through minitab 18
The 7 basic quality tools through minitab 18
 
Informatica complex transformation ii
Informatica complex transformation iiInformatica complex transformation ii
Informatica complex transformation ii
 
Examine Statistics in IBM SPSS Modeler.pptx
Examine Statistics in IBM SPSS Modeler.pptxExamine Statistics in IBM SPSS Modeler.pptx
Examine Statistics in IBM SPSS Modeler.pptx
 
社會網絡分析UCINET Quick Start Guide
社會網絡分析UCINET Quick Start Guide社會網絡分析UCINET Quick Start Guide
社會網絡分析UCINET Quick Start Guide
 
Visual Logic Project - 1
Visual Logic Project - 1Visual Logic Project - 1
Visual Logic Project - 1
 
Top tableau questions and answers in 2019
Top tableau questions and answers in 2019Top tableau questions and answers in 2019
Top tableau questions and answers in 2019
 
Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...Tutorial on how to load images in crystal reports dynamically using visual ba...
Tutorial on how to load images in crystal reports dynamically using visual ba...
 
Neural Network Model
Neural Network ModelNeural Network Model
Neural Network Model
 
Datastream professional getting started guide
Datastream professional   getting started guideDatastream professional   getting started guide
Datastream professional getting started guide
 
Visualize your Twitter network
Visualize your Twitter networkVisualize your Twitter network
Visualize your Twitter network
 
Informatica session
Informatica sessionInformatica session
Informatica session
 

More from Jeffrey Strickland, Ph.D., CMSP (7)

Model Inventory
Model InventoryModel Inventory
Model Inventory
 
Data science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebookData science and_analytics_for_ordinary_people_ebook
Data science and_analytics_for_ordinary_people_ebook
 
Extend sim 01
Extend sim 01Extend sim 01
Extend sim 01
 
Training backed by experience
Training backed by experienceTraining backed by experience
Training backed by experience
 
SEAS Space Surveillance Study
SEAS Space Surveillance StudySEAS Space Surveillance Study
SEAS Space Surveillance Study
 
Predictive Modeling and Analytics select_chapters
Predictive Modeling and Analytics select_chaptersPredictive Modeling and Analytics select_chapters
Predictive Modeling and Analytics select_chapters
 
I/ITSEC2009 Best Tutorial
I/ITSEC2009 Best TutorialI/ITSEC2009 Best Tutorial
I/ITSEC2009 Best Tutorial
 

Predictive Modeling with Enterprise Miner

  • 1. Copyright © 2011 Clarity Solution Group Predictive Modeling with Enterprise Miner Jeffrey Strickland, Ph.D. Senior Consultant
  • 2. Copyright © 2012 Clarity Solution Group Learning Objectives •To understand the application of regression analysis in data mining •Linear/nonlinear •Logistic (Logit) •To understand the key statistical measures of fit •To learn how to run and interpret regression analyses using SAS Enterprise Miner software
  • 3. Copyright © 2012 Clarity Solution Group SAS Enterprise Miner •These results can be obtained using Excel or using a data mining package such as SAS Enterprise Miner 5.3 •Using SAS Enterprise Miner requires the following steps: •Convert your data (usually in an Excel file) into a SAS data file Using SAS 9.1 •Create a project in Enterprise Miner •Within the project: •Create a data source using your SAS data file •Create a diagram that includes a data node and a regression node and a multiplot node for graphs •Run the model in the diagram and review the results
  • 4. Copyright © 2012 Clarity Solution Group Creating a SAS data file from an Excel file: open SAS 9.1. Select Filethen Import Data
  • 5. Copyright © 2012 Clarity Solution Group This opens the import wizard. Since the source file is from Excel, click Next. Then click Browseto find the TempKWatts.xlsfile
  • 6. Copyright © 2012 Clarity Solution Group Since the data are on sheet1$, click Next. Then enter SASUSERas the Library and TEMPKILOWATTL as the Member. Then click Next
  • 7. Copyright © 2012 Clarity Solution Group Now click Finish to create your file
  • 8. Copyright © 2012 Clarity Solution Group Open SAS Enterprise Miner 5.3. Enter the user name and password provided
  • 9. Copyright © 2012 Clarity Solution Group The Enterprise Window below opens. Select New Project
  • 10. Copyright © 2012 Clarity Solution Group The Create New Project dialog box appears. Select the Generaltab, then type the short name of the project, e.g., KWattTemp0. Keep the default path.
  • 11. Copyright © 2012 Clarity Solution Group In the Startup code tab, enter: libname Ktemps "C:Documents and SettingsmliberatMy DocumentsMy SAS Files9.1EM_Projects"; This code will be run each time you open the project
  • 12. Copyright © 2012 Clarity Solution Group The Enterprise Miner application window opens
  • 13. Copyright © 2012 Clarity Solution Group Right-click on Data Source, opening the wizard. Source is SAS table, so click Next
  • 14. Copyright © 2012 Clarity Solution Group Browsethe SAS libraries to find the SAS table Tempkilowattlfound in the SASuser Library (previously created)
  • 15. Copyright © 2012 Clarity Solution Group Click Nexttwice. Note that the Table properties shows that we have two variables with 12 observations
  • 16. Copyright © 2012 Clarity Solution Group The next step controls how Enterprise Miner organizes metadata for the variables in your data. Select advanced, then click next(you can view/change the settings if you click Customizebefore clicking Next)
  • 17. Copyright © 2012 Clarity Solution Group Change Roleof KWattsto target(outcome variable); change Levelof both KWatts and Temp to interval(continuous values); then click Next (Other levels are possible, such as binary). You can click on Explore if you wish to look at some basic stats –we will do this later
  • 18. Copyright © 2012 Clarity Solution Group Here Role relates to the role of the data set (raw, train, validate, score); raw is fine for our analysis of data, so click Finish
  • 19. Copyright © 2012 Clarity Solution Group Tempkilowattlnow appears under Data Sourcesin the top left panel called the Project Panel
  • 20. Copyright © 2012 Clarity Solution Group We need to create a Diagramfor our model. Right-click on Diagrams, then enter TempKwatts0in the dialog box. Now the left panel shows TempKwatts0as a Diagram, and the right- hand panel is called the Diagram Workspace. Icons can be dragged and dropped onto the Diagram Workspace.
  • 21. Copyright © 2012 Clarity Solution Group Now add an Input Data Node to the Diagram. From the Data Sourceslist in the Project Panel drag and drop the Data SourceTempKwatts0 onto the Diagram Workspace. Note that when input data node is highlighted, various properties are displayed on the left-hand panel.
  • 22. Copyright © 2012 Clarity Solution Group If you wish to see the properties of any or all of the variables, highlight the input data node; then on the left hand Properties Panel under Train, click on the box to the right of Variables; in the screen that opens control-click on KWattsand Temp; then click onExplore in the lower right
  • 23. Copyright © 2012 Clarity Solution Group Frequency distributions for the variables and the raw data are provided. Right-clicking on observations in the lower-left panel will show where they appear in the bar charts. Cancel when finished.
  • 24. Copyright © 2012 Clarity Solution Group Click on the Exploretab found over the Diagram Workspace, and then drag and drop the Multiploticon onto the field. Using your cursor, draw a directed arrow from the TempKwattslicon to the Multiploticon. With the Multiploticon highlighted, its properties are found in the left-hand Properties Panel.
  • 25. Copyright © 2012 Clarity Solution Group Right-click on the Multiplot iconand select Run. After the run is completed select Results from the Run Status window.
  • 26. Copyright © 2012 Clarity Solution Group Various charts are available as shown below. Descriptive statistics for each variable are given in the lower pane.
  • 27. Copyright © 2012 Clarity Solution Group Click on the Modeltab and drag the Regressionicon onto the Model field. Connect the Tempkwattslicon to the Regressionicon. Highlight the Regressionicon and on the Property Panel change Regression Type to linear regression.
  • 28. Copyright © 2012 Clarity Solution Group Run the Regressionand select Results. Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R-square), and the two effects (intercept and slope –bars represent size and color represents direction)
  • 29. Copyright © 2012 Clarity Solution Group For a given percentile, the Target Mean is the actual (or estimated value based on actuals), or what you are trying to predict; the Mean for Predicted is the forecasted values, or the predictions (or estimated values based on forecasts). The results are shown from highest to lowest forecasted values. The distances between the curves shows how well the model predicts the actual data.
  • 30. Copyright © 2012 Clarity Solution Group A variety of fit statistics are provided. These include SSE, MSE=SSE/(n-2), ASE=SSE/n, RMSE=SQRT(MSE), RASE=SQRT(SSE), FPE = MSE (n+p+1)/n, MAX = largest error in terms of absolute value, where n = no. of observations, p=no. of variables in model (one in our case). Schwartz’s Bayesian Criterion and Akaike’s Information Criterion are used for model selection (comparing one model to another). Schwartz’s adjusts the residual squared error for the number of parameters estimated, while Akaike’s is a relative measure of information lost from fitting the model.
  • 31. Copyright © 2012 Clarity Solution Group Kwatts vs. Temp Example 2 •Another approach to modeling the relationship between Kwatts and Temp is to use a nonlinear regression •This is easily accomplished in Enterprise Miner –highlight the regression node, then in the left hand panel select yes for polynomial terms •We use the default of two terms •Is the fit any better???
  • 32. Copyright © 2012 Clarity Solution Group
  • 33. Copyright © 2012 Clarity Solution Group Multiple Regression Consider the following data relating family size and income to food expenditures: familyfood $income $ family size 15.2283 25.1263 35.6322 44.6241 511.3544 68.1592 77.8443 85.8302 95.1401 1018826 114.9423 1211.8584 135.2281 144.8205 157.9423 166.4471 17201126 1813.7855 195.1312 202.9262
  • 34. Copyright © 2012 Clarity Solution Group Multiple Regression •We can run this problem in Enterprise Miner using the same approach followed with the previous example •On our model field we have placed the data source called foodexpenditures, and also bothMultiplot andStatExplore found under the Exploretab above the model field •Highlight foodexpenditures, then in the left-hand panel under Training, find variablesand click on the box to the right to open up the variables •Change the roleof familyto rejected(it is just the number of the observation) and change the levelof food_to target, and income_, food_,and fam_sizeto interval, then clickOK
  • 35. Copyright © 2012 Clarity Solution Group Foodexpenditures Model
  • 36. Copyright © 2012 Clarity Solution Group Highlight the StatExplorenode, right-click to Run,then select Results. Correlations between the input variables and the target are provided, along with basic statistics. The input variables are ordered by the size of the correlations. Now close out the results window and run the regressionnode and obtain results
  • 37. Copyright © 2012 Clarity Solution Group Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R-square), and the three effects (intercept and slopes for the two input variables with bars represent size and color represents direction). The model is significant and is a good fit with the data.
  • 38. Copyright © 2012 Clarity Solution Group What happens in regression analysis when the target variable is binary? •There are many situations when the target variable is binary –some examples: •whether a customer will or will not receive credit •whether a customer will or will not response to a promotion •Whether a firm will go bankrupt in a year •Whether a student will pass an exam!!!
  • 39. Copyright © 2012 Clarity Solution Group Passing an Exam Data Student idOutcomeStudy Hours 103 2134 3017 406 5012 6115 7126 8129 9014 10158 1102 12131 13126 14011
  • 40. Copyright © 2012 Clarity Solution Group Running a linear regression to predict pass/don’t pass as a function of hours of study provides a model that doesn’t correctly model the data. The data are given in exampassing.xls Passing an Exam00.20.40.60.811.21.41.6010203040506070hours of study pass or don't pass ActualPredicted
  • 41. Copyright © 2012 Clarity Solution Group The Enterprise Miner results show a poor fit on a percentile basis between predicted and target –another modeling approach is needed.
  • 42. Copyright © 2012 Clarity Solution Group Logistic Regression •Similar to linear regression, two main differences •Y (outcome or response) is categorical •Yes/No •Approve/Reject •Responded/Did not respond •Result is expressed as a probabilityof being in either group.
  • 43. Copyright © 2012 Clarity Solution Group Comparing the Logistic & Linear Regression Models
  • 44. Copyright © 2012 Clarity Solution Group Logisitic regression p = Prob(y=1|x) = exp(a+bx)/[1+exp(a+bx)] 1-p =1/[1+exp(a+bx)] ln [p/(1-p)] = a + bx where: exp ore is the exponential function(e=2.71828…) lnis the natural logarithm (ln(e) = 1) p is probability that the event y occurs given x, and can range between 0 and 1 p/(1-p) is the "odds ratio" ln[p/(1-p)] is the log odds ratio, or "logit" all other components of the regression model are the same
  • 45. Copyright © 2012 Clarity Solution Group Odds Ratio •Frequently used •Related to probability of an event as follows: Odds Ratio = p/(1-p) •Example: •Probability of firm going bankrupt = .25 •Odds firm will go bankrupt = .25/(1-.25) = 1/3 or 3 to 1 •This is how sports books calculate odds •(e.g., if odds of VU winning a championship are 2:1, probability is 1/3 •ln [p/(1-p)] = a + bx means that as x increases by 1, the natural log of the odds ratio increases by b, or the odds ratio increase by a factor of exp(b)
  • 46. Copyright © 2012 Clarity Solution Group Probability, Odds Ratio, LN of Odds Ratio -50510152025 0.050.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.850.90.95 probabilityoddsnl(odds)
  • 47. Copyright © 2012 Clarity Solution Group Running the exam data: Change regression type from linear regressionto logistic regressionHighlight the data node; on left-hand panel under Trainopen variablesand change the levelof outcometo binary
  • 48. Copyright © 2012 Clarity Solution Group Results show a much better fit (upper left) and only one misclassification (lower right –a false negative).
  • 49. Copyright © 2012 Clarity Solution Group The results show that the odds ratio = p(1-p) = exp(- 8.4962+0.4949x). For every additional hour of study the odds ratio increases by a factor of exp(0.4949)= 1.640
  • 50. Copyright © 2012 Clarity Solution Group Understanding Response Rate and Lift To better understand the top left chart, change cumulative liftto cumulative % response. The observations are ranked by the predicted probability of response (highest to lowest) for each observation (from the fitted model).
  • 51. Copyright © 2012 Clarity Solution Group Understanding Response Rate and Lift •Since the first 6 passes were correctly classified, the cumulative % response is 100% through the 40thpercentile. •At the 50thpercentile the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 6/7 or 85.7%. •The 8thranked observation, between the 55thand 60thpercentile, is a positive response, so the cumulative % response is about 7/8 or 87%. •Since there are no more positive responses after the 60thpercentile, the cumulative response rate will drop to 50%. •The chart compares how well the cumulative ranked predictions lead to a match between actual and predicted responses
  • 52. Copyright © 2012 Clarity Solution Group Understanding Response Rate and Lift •Lift calculates the ratio of the actual response rate (passing) of the top n% of the ranked observations to the overall response rate. Cumulative lift is likewise defined. •At the 50thpercentile, the cumulative % response is 88.7%, the cumulative base response is 50%, for a lift of 1.7142.
  • 53. Copyright © 2012 Clarity Solution Group On the Properties Panel, click on Exported Datato see the predicted probabilities and response for each observation and compare to the actual response.
  • 54. Copyright © 2012 Clarity Solution Group Logistic regression uses maximum likelihood (and not sum of squared errors) to estimate the model parameters. The results below show that the model is highly significant based on a chi-square test. The Wald chi-square statistic tests whether an effect is significant or not.
  • 55. Copyright © 2012 Clarity Solution Group Bankruptcy Prediction •To predict bankruptcy a year in advance, you might collect: •working capital/total assets (WC/TA) •retained earnings/total assets (RE/TA) •earnings before interest and taxes/total assets (EBIT/TA) •market value of equity/total debt (MVE/TD) •sales/total assets (S/TA)
  • 56. Copyright © 2012 Clarity Solution Group Bankruptcy Training Data FirmWC/TARE/TAEBIT/TAMVE/TDS/TABR/NB 10.01650.11920.20350.8131.67021 20.14150.38680.06810.57551.05791 30.58040.33310.0810.57551.05791 40.23040.2960.12250.41023.08091 50.36840.39130.05240.16581.15331 60.15270.33440.07830.77361.50461 70.11260.30710.08391.34291.57361 80.01410.23660.09050.58631.46511 90.2220.17970.15260.34591.72371 100.27760.25670.16420.29681.89041 110.26890.17290.02870.12240.92770 120.2039-0.04760.12630.89651.04570 130.5056-0.19510.20260.5381.95140 140.17590.13430.09460.19551.92180 150.35790.15150.08120.19911.45820 160.28450.20380.01710.33571.32580 170.12090.2823-0.01130.31572.32190 180.12540.19560.00790.20731.4890 190.17770.08910.06950.19241.68710 200.24090.1660.07460.25161.85240
  • 57. Copyright © 2012 Clarity Solution Group Bankruptcy Example •Using the BankruptTrain.xls data create a SAS data file called bankrupt •BR_NB: roleis targetand levelis binary •Firm: roleis rejectedand level is nominal(it is simply the firm number) •Remaining five financial ratio variables: roleis inputand levelis interval
  • 58. Copyright © 2012 Clarity Solution Group Create a diagram named bankrupt1. Drag and drop the data node onto the model. Highlight the data node and on the left hand panel under variables click on the box to its right to see the variables data
  • 59. Copyright © 2012 Clarity Solution Group From the Exploretab drag and drop the StatExplorenode onto the diagram and link it to the bankruptnode. Highlight the StatExplorenode, right-click and run it, and obtain results. On top, correlations between the five input variables and the target are shown via bars ordered from largest to smallest. Below the mean variable score for bankrupt vs. non-bankrupt observations is shown.
  • 60. Copyright © 2012 Clarity Solution Group From the Modeltab drag and drop the regressionnode onto the diagram and connect it to the bankruptnode. Highlight the regressionnode and run, and obtain the results
  • 61. Copyright © 2012 Clarity Solution Group The results show that the model fits the data very well with highly significant overall chi square statistic, low error values, and 0 misclassifications. Cumulative lift shows that for the top 50% of observations that are bankrupt, they are twice as likely to be classified as bankrupt.
  • 62. Copyright © 2012 Clarity Solution Group Scoring •Once you have specified a model you might wish to apply it to new data whose outcome is unknown --make predictions •This can be easily accomplished in Enterprise Miner using scoring •Convert the data set BankruptScore.xls to a SAS file called bankruptscore. The roleof this data is score.
  • 63. Copyright © 2012 Clarity Solution Group Bankruptcy Scoring Data FirmWC/TARE/TAEBIT/TAMVE/TDS/TA A0.17590.13430.09560.19551.9218 B0.37320.3483-0.00130.34831.8223 C0.17250.32380.1040.88470.5576 D0.1630.35550.0110.3732.8307 E0.19040.20110.13290.5581.6623 F0.11230.22880.010.18842.7186 G0.07320.35260.05870.23491.7432 H0.26530.26830.02350.51181.835 I0.1070.07870.04330.10831.2051 J0.29210.2390.96730.34020.9277
  • 64. Copyright © 2012 Clarity Solution Group Drag and drop the bankruptscoredata node to the bankrupt1 diagram. From the Assesstab, drag and drop the Scorenode into the diagram. Link the regressionand bankruptscorenodes together and connect them to the Score node.
  • 65. Copyright © 2012 Clarity Solution Group Run the Scorenode and obtain the Results. Of the 10 firms, 6 are predicted to become bankrupt.
  • 66. Copyright © 2012 Clarity Solution Group For details about the individual predictions, highlight the Scorenode and on the left-hand panel click on the square to the right of Exported Data. Then in the box that appears click on the row whose Port entry is Score. Then click on Explore.
  • 67. Copyright © 2012 Clarity Solution Group The lower portion of the output is shown below. The predictions are given, along with the probabilities of the firm becoming bankrupt or not.
  • 68. Copyright © 2012 Clarity Solution Group Regression Using Selection Models •When there are a number of possible input variables, procedures are available to sort through them and include those that have a certain level of statistical significance •SAS Enterprise Miner 5.3 offers three selection methods: •Backward •Forward •Stepwise
  • 69. Copyright © 2012 Clarity Solution Group Regression Using Selection Models •Backward: training begins with all candidate effects in the model and removes effects until the stay significance levelor the stop criterion is met •Forward: training begins with no candidate effects in the model and adds effects until the entry significance levelor the stop criterion is met. •Stepwise:training begins as in the forward model but may remove effects already in the model. This continues until the stay significance levelor the stop criterion is met Note that the default significance levels (p values) values are 0.05 and no stop criteria (such as maximum number of steps in the regression) are set
  • 70. Copyright © 2012 Clarity Solution Group Regression Using Selection Models –Bankruptcy Model To select stepwise regression for the bankruptcy model, highlight the regression node and in the properties panel under Selection Model choose Stepwise. The default significance level of 0.05 is used
  • 71. Copyright © 2012 Clarity Solution Group Regression Using Selection Models –Bankruptcy Model •Interestingly, the Training Model only uses RE/TA as a predictor •There are 3 misclassifications (.15 rate) in this set vs. 0 in the original model •The results are very different: the original model with all 5 input variables predicted bankruptcy for G, E, C, and J, while the stepwise model predicted B, C, D, F, G, H, and J would become bankrupt. •Changing the significance levels to 0.1 (to make it easier for input variables to enter/leave the stepwise model) produces the same results