Predictive Modeling with Enterprise Miner

Copyright © 2011 Clarity Solution Group
Predictive Modeling with Enterprise Miner
Jeffrey Strickland, Ph.D.
Senior Consultant

Learning Objectives
•To understand the application of regression analysis in data mining
•Linear/nonlinear
•Logistic (Logit)
•To understand the key statistical measures of fit
•To learn how to run and interpret regression analyses using SAS Enterprise Miner software

SAS Enterprise Miner
•These results can be obtained using Excel or using a data mining package such as SAS Enterprise Miner 5.3
•Using SAS Enterprise Miner requires the following steps:
•Convert your data (usually in an Excel file) into a SAS data file Using SAS 9.1
•Create a project in Enterprise Miner
•Within the project:
•Create a data source using your SAS data file
•Create a diagram that includes a data node and a regression node and a multiplot node for graphs
•Run the model in the diagram and review the results

Creating a SAS data file from an Excel file: open SAS 9.1. Select Filethen Import Data

This opens the import wizard. Since the source file is from Excel, click Next. Then click Browseto find the TempKWatts.xlsfile

Since the data are on sheet1$, click Next. Then enter SASUSERas the Library and TEMPKILOWATTL as the Member. Then click Next

Now click Finish to create your file

Open SAS Enterprise Miner 5.3. Enter the user name and password provided

The Enterprise Window below opens. Select New Project

The Create New Project dialog box appears. Select the Generaltab, then type the short name of the project, e.g., KWattTemp0. Keep the default path.

In the Startup code tab, enter: libname Ktemps "C:Documents and SettingsmliberatMy DocumentsMy SAS Files9.1EM_Projects"; This code will be run each time you open the project

The Enterprise Miner application window opens

Right-click on Data Source, opening the wizard. Source is SAS table, so click Next

Browsethe SAS libraries to find the SAS table Tempkilowattlfound in the SASuser Library (previously created)

Click Nexttwice. Note that the Table properties shows that we have two variables with 12 observations

The next step controls how Enterprise Miner organizes metadata for the variables in your data. Select advanced, then click next(you can view/change the settings if you click Customizebefore clicking Next)

Change Roleof KWattsto target(outcome variable); change Levelof both KWatts and Temp to interval(continuous values); then click Next (Other levels are possible, such as binary). You can click on Explore if you wish to look at some basic stats –we will do this later

Here Role relates to the role of the data set (raw, train, validate, score); raw is fine for our analysis of data, so click Finish

Tempkilowattlnow appears under Data Sourcesin the top left panel called the Project Panel

We need to create a Diagramfor our model. Right-click on Diagrams, then enter TempKwatts0in the dialog box. Now the left panel shows TempKwatts0as a Diagram, and the right- hand panel is called the Diagram Workspace. Icons can be dragged and dropped onto the Diagram Workspace.

Now add an Input Data Node to the Diagram. From the Data Sourceslist in the Project Panel drag and drop the Data SourceTempKwatts0 onto the Diagram Workspace. Note that when input data node is highlighted, various properties are displayed on the left-hand panel.

If you wish to see the properties of any or all of the variables, highlight the input data node; then on the left hand Properties Panel under Train, click on the box to the right of Variables; in the screen that opens control-click on KWattsand Temp; then click onExplore in the lower right

Frequency distributions for the variables and the raw data are provided. Right-clicking on observations in the lower-left panel will show where they appear in the bar charts. Cancel when finished.

Click on the Exploretab found over the Diagram Workspace, and then drag and drop the Multiploticon onto the field. Using your cursor, draw a directed arrow from the TempKwattslicon to the Multiploticon. With the Multiploticon highlighted, its properties are found in the left-hand Properties Panel.

Right-click on the Multiplot iconand select Run. After the run is completed select Results from the Run Status window.

Various charts are available as shown below. Descriptive statistics for each variable are given in the lower pane.

Click on the Modeltab and drag the Regressionicon onto the Model field. Connect the Tempkwattslicon to the Regressionicon. Highlight the Regressionicon and on the Property Panel change Regression Type to linear regression.

Run the Regressionand select Results. Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R-square), and the two effects (intercept and slope –bars represent size and color represents direction)

For a given percentile, the Target Mean is the actual (or estimated value based on actuals), or what you are trying to predict; the Mean for Predicted is the forecasted values, or the predictions (or estimated values based on forecasts). The results are shown from highest to lowest forecasted values. The distances between the curves shows how well the model predicts the actual data.

A variety of fit statistics are provided. These include SSE, MSE=SSE/(n-2), ASE=SSE/n, RMSE=SQRT(MSE), RASE=SQRT(SSE), FPE = MSE (n+p+1)/n, MAX = largest error in terms of absolute value, where n = no. of observations, p=no. of variables in model (one in our case). Schwartz’s Bayesian Criterion and Akaike’s Information Criterion are used for model selection (comparing one model to another). Schwartz’s adjusts the residual squared error for the number of parameters estimated, while Akaike’s is a relative measure of information lost from fitting the model.

Kwatts vs. Temp Example 2
•Another approach to modeling the relationship between Kwatts and Temp is to use a nonlinear regression
•This is easily accomplished in Enterprise Miner –highlight the regression node, then in the left hand panel select yes for polynomial terms
•We use the default of two terms
•Is the fit any better???

Multiple Regression
Consider the following data relating family size and income to food expenditures:
familyfood $income $ family size
15.2283
25.1263
35.6322
44.6241
511.3544
68.1592
77.8443
85.8302
95.1401
1018826
114.9423
1211.8584
135.2281
144.8205
157.9423
166.4471
17201126
1813.7855
195.1312
202.9262

Multiple Regression
•We can run this problem in Enterprise Miner using the same approach followed with the previous example
•On our model field we have placed the data source called foodexpenditures, and also bothMultiplot andStatExplore found under the Exploretab above the model field
•Highlight foodexpenditures, then in the left-hand panel under Training, find variablesand click on the box to the right to open up the variables
•Change the roleof familyto rejected(it is just the number of the observation) and change the levelof food_to target, and income_, food_,and fam_sizeto interval, then clickOK

Foodexpenditures Model

Highlight the StatExplorenode, right-click to Run,then select Results. Correlations between the input variables and the target are provided, along with basic statistics. The input variables are ordered by the size of the correlations. Now close out the results window and run the regressionnode and obtain results

Starting from the upper left and going clockwise, these windows show the fit between target and predicted in percentile terms, the various fit statistics, model output (estimates, F and t stats, R-square), and the three effects (intercept and slopes for the two input variables with bars represent size and color represents direction). The model is significant and is a good fit with the data.

What happens in regression analysis when the target variable is binary?
•There are many situations when the target variable is binary –some examples:
•whether a customer will or will not receive credit
•whether a customer will or will not response to a promotion
•Whether a firm will go bankrupt in a year
•Whether a student will pass an exam!!!

Passing an Exam Data
Student idOutcomeStudy Hours
103
2134
3017
406
5012
6115
7126
8129
9014
10158
1102
12131
13126
14011

Running a linear regression to predict pass/don’t pass as a function of hours of study provides a model that doesn’t correctly model the data. The data are given in exampassing.xls
Passing an Exam00.20.40.60.811.21.41.6010203040506070hours of study pass or don't pass ActualPredicted

The Enterprise Miner results show a poor fit on a percentile basis between predicted and target –another modeling approach is needed.

Logistic Regression
•Similar to linear regression, two main differences
•Y (outcome or response) is categorical
•Yes/No
•Approve/Reject
•Responded/Did not respond
•Result is expressed as a probabilityof being in either group.

Comparing the Logistic & Linear Regression Models

Logisitic regression
p = Prob(y=1|x) = exp(a+bx)/[1+exp(a+bx)]
1-p =1/[1+exp(a+bx)]
ln [p/(1-p)] = a + bx
where:
exp ore is the exponential function(e=2.71828…)
lnis the natural logarithm (ln(e) = 1)
p is probability that the event y occurs given x, and can range between 0 and 1
p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
all other components of the regression model are the same

Odds Ratio
•Frequently used
•Related to probability of an event as follows: Odds Ratio = p/(1-p)
•Example:
•Probability of firm going bankrupt = .25
•Odds firm will go bankrupt = .25/(1-.25) = 1/3 or 3 to 1
•This is how sports books calculate odds
•(e.g., if odds of VU winning a championship are 2:1, probability is 1/3
•ln [p/(1-p)] = a + bx means that as x increases by 1, the natural log of the odds ratio increases by b, or the odds ratio increase by a factor of exp(b)

Probability, Odds Ratio, LN of Odds Ratio
-50510152025 0.050.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.850.90.95 probabilityoddsnl(odds)

Running the exam data: Change regression type from linear regressionto logistic regressionHighlight the data node; on left-hand panel under Trainopen variablesand change the levelof outcometo binary

Results show a much better fit (upper left) and only one misclassification (lower right –a false negative).

The results show that the odds ratio = p(1-p) = exp(- 8.4962+0.4949x). For every additional hour of study the odds ratio increases by a factor of exp(0.4949)= 1.640

Understanding Response Rate and Lift
To better understand the top left chart, change cumulative liftto cumulative % response. The observations are ranked by the predicted probability of response (highest to lowest) for each observation (from the fitted model).

•Since the first 6 passes were correctly classified, the cumulative % response is 100% through the 40thpercentile.
•At the 50thpercentile the next observation with the highest predicted probability is a non-response, so the cumulative response drops to 6/7 or 85.7%.
•The 8thranked observation, between the 55thand 60thpercentile, is a positive response, so the cumulative % response is about 7/8 or 87%.
•Since there are no more positive responses after the 60thpercentile, the cumulative response rate will drop to 50%.
•The chart compares how well the cumulative ranked predictions lead to a match between actual and predicted responses

•Lift calculates the ratio of the actual response rate (passing) of the top n% of the ranked observations to the overall response rate. Cumulative lift is likewise defined.
•At the 50thpercentile, the cumulative % response is 88.7%, the cumulative base response is 50%, for a lift of 1.7142.

On the Properties Panel, click on Exported Datato see the predicted probabilities and response for each observation and compare to the actual response.

Logistic regression uses maximum likelihood (and not sum of squared errors) to estimate the model parameters. The results below show that the model is highly significant based on a chi-square test. The Wald chi-square statistic tests whether an effect is significant or not.

Bankruptcy Prediction
•To predict bankruptcy a year in advance, you might collect:
•working capital/total assets (WC/TA)
•retained earnings/total assets (RE/TA)
•earnings before interest and taxes/total assets (EBIT/TA)
•market value of equity/total debt (MVE/TD)
•sales/total assets (S/TA)

Bankruptcy Training Data
FirmWC/TARE/TAEBIT/TAMVE/TDS/TABR/NB
10.01650.11920.20350.8131.67021
20.14150.38680.06810.57551.05791
30.58040.33310.0810.57551.05791
40.23040.2960.12250.41023.08091
50.36840.39130.05240.16581.15331
60.15270.33440.07830.77361.50461
70.11260.30710.08391.34291.57361
80.01410.23660.09050.58631.46511
90.2220.17970.15260.34591.72371
100.27760.25670.16420.29681.89041
110.26890.17290.02870.12240.92770
120.2039-0.04760.12630.89651.04570
130.5056-0.19510.20260.5381.95140
140.17590.13430.09460.19551.92180
150.35790.15150.08120.19911.45820
160.28450.20380.01710.33571.32580
170.12090.2823-0.01130.31572.32190
180.12540.19560.00790.20731.4890
190.17770.08910.06950.19241.68710
200.24090.1660.07460.25161.85240

Bankruptcy Example
•Using the BankruptTrain.xls data create a SAS data file called bankrupt
•BR_NB: roleis targetand levelis binary
•Firm: roleis rejectedand level is nominal(it is simply the firm number)
•Remaining five financial ratio variables: roleis inputand levelis interval

Create a diagram named bankrupt1. Drag and drop the data node onto the model. Highlight the data node and on the left hand panel under variables click on the box to its right to see the variables data

From the Exploretab drag and drop the StatExplorenode onto the diagram and link it to the bankruptnode. Highlight the StatExplorenode, right-click and run it, and obtain results. On top, correlations between the five input variables and the target are shown via bars ordered from largest to smallest. Below the mean variable score for bankrupt vs. non-bankrupt observations is shown.

From the Modeltab drag and drop the regressionnode onto the diagram and connect it to the bankruptnode. Highlight the regressionnode and run, and obtain the results

The results show that the model fits the data very well with highly significant overall chi square statistic, low error values, and 0 misclassifications. Cumulative lift shows that for the top 50% of observations that are bankrupt, they are twice as likely to be classified as bankrupt.

Scoring
•Once you have specified a model you might wish to apply it to new data whose outcome is unknown --make predictions
•This can be easily accomplished in Enterprise Miner using scoring
•Convert the data set BankruptScore.xls to a SAS file called bankruptscore. The roleof this data is score.

Bankruptcy Scoring Data
FirmWC/TARE/TAEBIT/TAMVE/TDS/TA
A0.17590.13430.09560.19551.9218
B0.37320.3483-0.00130.34831.8223
C0.17250.32380.1040.88470.5576
D0.1630.35550.0110.3732.8307
E0.19040.20110.13290.5581.6623
F0.11230.22880.010.18842.7186
G0.07320.35260.05870.23491.7432
H0.26530.26830.02350.51181.835
I0.1070.07870.04330.10831.2051
J0.29210.2390.96730.34020.9277

Drag and drop the bankruptscoredata node to the bankrupt1 diagram. From the Assesstab, drag and drop the Scorenode into the diagram. Link the regressionand bankruptscorenodes together and connect them to the Score node.

Run the Scorenode and obtain the Results. Of the 10 firms, 6 are predicted to become bankrupt.

For details about the individual predictions, highlight the Scorenode and on the left-hand panel click on the square to the right of Exported Data. Then in the box that appears click on the row whose Port entry is Score. Then click on Explore.

The lower portion of the output is shown below. The predictions are given, along with the probabilities of the firm becoming bankrupt or not.

Regression Using Selection Models
•When there are a number of possible input variables, procedures are available to sort through them and include those that have a certain level of statistical significance
•SAS Enterprise Miner 5.3 offers three selection methods:
•Backward
•Forward
•Stepwise

Regression Using Selection Models
•Backward: training begins with all candidate effects in the model and removes effects until the stay significance levelor the stop criterion is met
•Forward: training begins with no candidate effects in the model and adds effects until the entry significance levelor the stop criterion is met.
•Stepwise:training begins as in the forward model but may remove effects already in the model. This continues until the stay significance levelor the stop criterion is met
Note that the default significance levels (p values) values are 0.05 and no stop criteria (such as maximum number of steps in the regression) are set

Regression Using Selection Models –Bankruptcy Model
To select stepwise regression
for the bankruptcy model, highlight
the regression node and in the
properties panel under
Selection Model choose
Stepwise. The default significance
level of 0.05 is used

Regression Using Selection Models –Bankruptcy Model
•Interestingly, the Training Model only uses RE/TA as a predictor
•There are 3 misclassifications (.15 rate) in this set vs. 0 in the original model
•The results are very different: the original model with all 5 input variables predicted bankruptcy for G, E, C, and J, while the stepwise model predicted B, C, D, F, G, H, and J would become bankrupt.
•Changing the significance levels to 0.1 (to make it easier for input variables to enter/leave the stepwise model) produces the same results

Predictive Modeling with Enterprise Miner

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Viewers also liked

Viewers also liked (11)

Similar to Predictive Modeling with Enterprise Miner

Similar to Predictive Modeling with Enterprise Miner (20)

More from Jeffrey Strickland, Ph.D., CMSP

More from Jeffrey Strickland, Ph.D., CMSP (7)

Predictive Modeling with Enterprise Miner