SlideShare a Scribd company logo
1 of 50
Credit Card Marketing
Classification Trees
From Building Better Models with JMP® Pro,
Chapter 6, SAS Press (2015). Grayson, Gardner
and Stephens.
Used with permission. For additional information,
see community.jmp.com/docs/DOC-7562.
2
Credit Card Marketing
Classification Trees
Key ideas: Classification trees, validation, confusion matrix,
misclassification, leaf report, ROC
curves, lift curves.
Background
A bank would like to understand the demographics and other
characteristics associated with whether a
customer accepts a credit card offer. Observational data is
somewhat limited for this kind of problem, in
that often the company sees only those who respond to an offer.
To get around this, the bank designs a
focused marketing study, with 18,000 current bank customers.
This focused approach allows the bank to
know who does and does not respond to the offer, and to use
existing demographic data that is already
available on each customer.
The designed approach also allows the bank to control for other
potentially important factors so that the
offer combination isn’t confused or confounded with the
demographic factors. Because of the size of the
data and the possibility that there are complex relationships
between the response and the studied
factors, a decision tree is used to find out if there is a smaller
subset of factors that may be more
important and that warrant further analysis and study.
The Task
We want to build a model that will provide insight into why
some bank customers accept credit card offers.
Because the response is categorical (either Yes or No) and we
have a large number of potential predictor
variables, we use the Partition platform to build a classification
tree for Offer Accepted. We are primarily
interested in understanding characteristics of customers who
have accepted an offer, so the resulting
model will be exploratory in nature.1
The Data Credit Card Marketing BBM.jmp
The data set consists of information on the 18,000 current bank
customers in the study.
Customer Number: A sequential number assigned to the
customers (this column is hidden and
excluded – this unique identifier will not be used directly).
Offer Accepted: Did the customer accept (Yes) or reject (No)
the offer.
Reward: The type of reward program offered for the card.
Mailer Type: Letter or postcard.
Income Level: Low, Medium or High.
# Bank Accounts Open: How many non-credit-card accounts are
held by the customer.
1 In exploratory modeling, the goal is to understand the
variables or characteristics that drive behaviors or particular
outcomes. In
predictive modeling, the goal is to accurately predict new
observations and future behaviors, given the current information
and
situation.
3
Overdraft Protection: Does the customer have overdraft
protection on their checking account(s)
(Yes or No).
Credit Rating: Low, Medium or High.
# Credit Cards Held: The number of credit cards held at the
bank.
# Homes Owned: The number of homes owned by the customer.
Household Size: Number of individuals in the family.
Own Your Home: Does the customer own their home? (Yes or
No).
Average Balance: Average account balance (across all accounts
over time).
Q1, Q2, Q3 and Q4 Balance: Average balance for each quarter
in the last year.
Prepare for Modeling
We start by getting to know our data. We explore the data one
variable at a time, two at a time, and many
variables at a time to gain an understanding of data quality and
of potential relationships. Since the focus
of this case study is classification trees, only some of this work
is shown here. We encourage you to
thoroughly understand your data and take the necessary steps to
prepare your data for modeling before
building exploratory or predictive models.
Exploring Data One Variable at a Time
Since we have a relatively large data set with many potential
predictors, we start by creating numerical
summaries of each of our variables using the Columns Viewer
(see Exhibit 1). (Under the Cols menu
select Columns Viewer, then select all variables and click Show
Summary. To deselect the variables,
click Clear Select).
Exhibit 1 Credit, Summary Statistics for All Variables With
Columns Viewer
4
Under N Categories, we see that each of our categorical
variables has either two or three levels. N
Missing indicates that we are missing 24 observations for each
of the balance columns. (Further
investigation indicates that these values are missing from the
same 24 customers.) The other statistics
provide an idea of the centering, spread and shapes of the
continuous distributions.
Next, we graph our variables one at a time. (Select the variables
within the Columns Viewer and click on
the Distribution button. Or, use Analyze > Distribution, select
all of the variables as Y, Columns, and
click OK. Click Stack from the top red triangle for a horizontal
layout).
In Exhibit 2, we see that only around 5.68 percent of the 18,000
offers were accepted.
Exhibit 2 Credit, Distribution of Offer Accepted
We select the Yes level in Offer Accepted and then examine the
distribution of accepted offers (the
shaded area) across the other variables in our data set (the first
10 variables are shown in Exhibit 3).
Our two experimental variables are Reward and Mailer Type.
Offers promoting Points and Air Miles are
more frequently accepted than those promoting Cash Back,
while Postcards are accepted more often
than Letters. Offers also appear to be accepted at a higher rate
by customers with low to medium
income, no overdraft protection and low credit ratings.
Exhibit 3 Credit, Distribution of First 10 Variables
5
Note that both Credit Rating and Income Level are coded as
Low, Medium and High. Peeking at the
data table, we see that the modeling types for these variables
are both nominal, but since the values are
ordered categories they should be coded as ordinal variables. To
change modeling type, right-click on the
modeling type icon in the data table or in any dialog window,
and then select the correct modeling type.
Exploring Data Two Variables at a Time
We explore relationships between our response and potential
predictor variables using Analyze > Fit Y
by X (select Offer Accepted as Y, Response and the predictors
as X, Factor, and click OK.) For
categorical predictors (nominal or ordinal), Fit Y by X conducts
a contingency analysis (see Exhibit 4).
For continuous predictors, Fit Y by X fits a logistic regression
model.
The first two analyses in Exhibit 4 show potential relationships
between Offer Accepted and Reward
(left) and Offer Accepted and Mailer Type (right). Note that
tests for association between the categorical
predictors and Offer Accepted are also provided by default
(these are not shown in Exhibit 4), and
additional statistical options and tests are provided under the
top red triangles.
Exhibit 4 Credit Fit Y by X, Offer Accepted versus Reward
and Mailer Type
Although we haven’t thoroughly explored this data, thus far
we’ve learned that:
• Only a small percentage – roughly 5.68 percent – of offers are
accepted.
• We are missing some data for the Balance columns, but are not
missing values for any other
variable.
• Both of our experimental variables (Reward and Mailer Type)
appear to be related to
whether or not an offer is accepted.
• Two variables, Income Level and Credit Rating, should be
coded as Ordinal instead of
Nominal.
Again, we encourage you to thoroughly explore your data and to
investigate and resolve potential data
quality issues before building a model. Other tools, such as
scatterplots and the Graph Builder, should
also be used.
6
While we have only superficially explored the data in this
example (as we will also do in future examples),
the primary purpose of this exploration is to get to know the
variables used in this case study. As such, it
is intentionally brief.
The Partition Model Dialog
Having a good sense of data quality and potential relationships,
we now fit a partition model to the data
using Analyze > Modeling > Partition, with Offer Accepted as
Y, Response and all of the other
variables as X, Factor (see Exhibit 5).
Exhibit 5 Credit, Partition Dialog Window
A Bit About Model Validation
When we build a statistical model, there is a risk that the model
is overly complicated (overfit), or that the
model will not perform well when applied to new data. Model
validation (or cross-validation) is often used
to protect against over-fitting.2 There are two methods for
model validation available from the Partition
dialog window in JMP Pro: Specify a Validation Portion or
select a Validation column (note that other
methods are available from within the platform).
In this example, we’ll use a random hold out portion (30
percent) to protect against overfitting (to do so,
enter 0.30 in the Validation Portion field). This will assign 70
percent of the records to the training set,
which is used to build the model. The remaining 30 percent will
be assigned to the hold out validation set,
which will be used to see how well the model performs on data
not used to build the model.
Other Partition Model Dialog Options
Two additional options in the dialog window are Informative
Missing and Ordinal Restricts Order.
These are selected by default. In this example, we have two
ordinal predictors, Credit Rating and
2 For background information on model validation and
protecting against overfitting, see
en.wikipedia.org/wiki/Overfitting. For more
information on validation in JMP and JMP Pro, see Building
Better Models with JMP Pro, Chapter 6 and Chapter 8, or search
for
“validation” in the JMP Help.
7
Income Level. We also have missing values for the five balance
columns. The Informative Missing
option tells JMP to include rows with missing values in the
model, and the Ordinal Restricts Order
option tells JMP to respect the ordered categories for ordinal
variables. For more information on these
options, see JMP Help.
The completed Partition dialog window is displayed in Exhibit
5.
Building the Classification Tree
Initial results show the overall breakdown of Offer Accepted
(Exhibit 6). Recall that roughly 5.7 percent of
offers were accepted. Note: Since a random holdout is used
your results from this point forward may be
different.3
Below the graph, we see that 12,610 observations are assigned
to the training set. These observations
will be used to build the model. The remaining 5,390
observations in the validation set will be used to
check model performance and to stop tree growth.
Note that we have changed some of the default settings:
• Since we have a relatively large data set, points were removed
from the graph (click on the top
red triangle and select Display Options > Show Points).
• The response rates and counts are displayed in the tree nodes
(select Display Options > Show
Split Count from the top red triangle).
Exhibit 6 Credit, Partition Initial Window
3 To obtain the same results as shown here, use the Random
Seed Reset add-in to set the random seed to 123 before
launching
the Partition platform. The add-in can be downloaded and
installed from the JMP User Community:
community.jmp.com/docs/DOC-6601.
8
How Classification Trees Are Formed
When building a classification tree, JMP iteratively splits the
data based on the values of predictors to
form subsets. These subsets form the “branches” of the tree.
Each split is made at the predictor value
that causes the greatest difference in proportions (for the
outcome variable) in the resulting subsets.
A measure of the dissimilarity in proportions between the two
subsets is the likelihood ratio chi-square
statistic and its associated p-value. The lower the p-value, the
greater the difference between the groups.
When JMP calculates this chi-square statistic in the Partition
platform, it is labeled G^2, and the p-value
that is calculated is adjusted to account for the number of splits
that are being considered. The adjusted
p-value is transformed to a log scale using the formula -
log10(adjusted p-value). This value is called the
LogWorth. The bigger the LogWorth value, the better the split
(Sall, 2002).
To find the split with the largest difference between subgroups
(and the corresponding largest value of
LogWorth), we need to consider all possible splits. For each
variable, the best split location, or cut point,
is determined, and the split with the highest LogWorth is chosen
as the optimal split location.
JMP reports the G^2 and LogWorth values, along with the best
cut points for each variable, under
Candidates (use the gray disclosure icon next to Candidates to
display). A peek at the candidates in our
example indicates that the first split will be on Credit Rating,
with Low in one branch and High and
Medium in the other (Exhibit 7).
Exhibit 7 Credit, Partition Initial Candidate Splits
The tree after three splits (click Split three times) is shown in
Exhibit 8.
Not surprisingly, the model is split on Credit Rating, Reward
and Mailer Type. The lowest probability of
accepting the offer (0.0196) is Credit Rating(Medium, High)
and Reward(Cash Back, Points). The
highest probability (0.1473) is Credit Rating(Low) and Mailer
Type(Postcard).
After each split, the model RSquare (or, Entropy RSquare)
updates (this is shown at the top of Exhibit 8).
RSquare is a measure of how much variability in the response is
being explained by the model. Without a
validation set, we can continue to split until the minimum split
size is achieved in each branch. (The
minimum split size is an option under the top red triangle,
which is set to 5 by default.) However,
additional splits are not necessarily beneficial and lead to more
complex and potentially overfit models.
9
Exhibit 8 Credit, Partition after Three Splits
Since we have a validation set, we click Go to automate the
tree-building process. When this option is
used, the final model will be based on the model with the
maximum value of the Validation RSquare
statistic.
The Split History Report (Exhibit 9) shows how the RSquare
value changes for training and validation
data after each split (note that the y-axis has been rescaled for
illustration). The vertical line is drawn at
15, the number of splits used in the final model.
Exhibit 9 Split History, with Maximum Validation R-Square at
Fifteen Splits
10
This illustrates both the concept of overfitting and the
importance of using validation. With each split, the
RSquare for the training data continues to increase. However,
after 15 splits, the validation RSquare (the
lower line in Exhibit 9) starts to decrease. For the validation
set, which was not used to build the model,
additional splits are not improving our ability to predict the
response.
Understanding the Model
To summarize which variables are involved in these 15 splits,
we turn on Column Contributions (from
the top red triangle). This table indicates which variables are
most important in terms of the overall
contribution to the model (see Exhibit 10).
Credit Rating, Mailer Type, Reward and Income Level
contribute most to the model. Several variables,
including the five balance variables, are not involved in any of
the splits.
Exhibit 10 Credit, Split History after Fifteen Splits
Model Classifications and the Confusion Matrix
One overall measure of model accuracy is the Misclassification
Rate (select Show Fit Details from the
top red triangle). The misclassification rate for our validation
data is 0.0573, or 5.73 percent. The numbers
behind the misclassification rate can be seen in the confusion
matrix (bottom, in Exhibit 11). Here, we
focus on the misclassification rate and confusion matrix for the
validation data. Since these data were not
used in building the model, this approach provides a better
indication of how well the model classifies our
response, Offer Accepted.
There are four possible outcomes in our classification:
• An accepted offer is correctly classified as an accepted offer.
• An accepted offer is misclassified as not accepted.
• An offer that was not accepted is correctly classified as not
accepted.
• An offer that was not accepted is misclassified as accepted.
One observation is that there were few cases wherein the model
predicted that the offer would be
accepted (see value “2” in the Yes column of the validation
confusion matrix in Exhibit 11.) When the
target variable is unbalanced (i.e., there are far more
observations in one level than in the other), the
model that is fit will usually result in probabilities that are
small for the underrepresented category.
11
Exhibit 11 Credit, Fit Details With Confusion Matrix
In this case, the overall rate of Yes (i.e., offer accepted) is 5.68
percent, which is close to the
misclassification rate for this model. However, when we
examine the Leaf Report for the fitted model
(Exhibit 12), we see that there are branches in the tree that have
much richer concentrations of Offer
Accepted = Yes than the overall average rate. (Note that results
in the Leaf Report are for the training
data.)
Exhibit 12 Credit, Leaf Report for Fitted Model
12
The model has probabilities of Offer Accepted = Yes in the
range [0.0044, 0.6738]. When JMP classifies
rows with the model, it uses a default of Prob > 0.5 to make the
decision. In this case, only one of the
predicted probabilities of Yes is > 0.5, and this one branch (or
node) has only 11 observations: 8 yes and
3 no under Response Counts in the bottom table in Exhibit 12.
The next highest predicted probability of
Offer Accepted = Yes is 0.2056. As a result, all other rows are
classified as Offer Accepted = No.
The ROC Curve
Two additional measures of accuracy used when building
classification models are Sensitivity and 1-
Specificity. Sensitivity is the true positive rate. In our example,
this is the ability of our model to correctly
classify Offer Accepted as Yes. The second measure, 1-
Specificity, is the false positive rate. In this
case, a false positive occurs when an offer was not accepted, but
was classified as Yes (accepted).
Instead of using the default decision rule of Prob > 0.5, we
examine the decision rule Prob > T, where we
let the decision threshold T range from 0 to 1. We plot
Sensitivity (on the y-axis) versus 1-Specificity (on
the x-axis) for each possible threshold value. This creates a
Receiver Operating Characteristic (ROC)
curve. The ROC curve for our model is displayed in Exhibit 13
(this is a top red triangle option in the
Partition report).
Exhibit 13 Credit, ROC Curve for Offer Accepted
Conceptually, what the ROC curve measures is the ability of the
predicted probability formulas to rank an
observation. Here, we simply focus on the Yes outcome for the
Offer Accepted response variable. We
save the probability formula to the data table, and then sort the
table from highest to lowest probability. If
this probability model can correctly classify the outcomes for
Offer Accepted, we would expect to see
more Yes response values at the top (where the probability for
Yes is highest) than No responses.
Similarly, at the bottom of the sorted table, we would expect to
see more No than Yes response values.
13
Constructing an ROC Curve
What follows is a practical algorithm to quickly draw an ROC
curve after the table has been sorted by the
predicted probability. Here, we walk through the algorithm for
Offer Accepted = Yes, but this is done
automatically in JMP for each response category.
For each observation in the sorted table, starting at the
observation with the highest probability, Offer
Accepted = Yes:
• If the observed response value is Yes, then a vertical line
segment (increasing along the Sensitivity
axis) is drawn. The length of the line segment is 1/(total number
of Yes responses in the table).
• If the observed response value is No, then a horizontal line
segment (increasing along the 1-
Specificity axis) is drawn. The length of the line segment is
1/(total number of “No” responses in the
data table).
Simple ROC Curve Examples
We use a simple example to illustrate. Suppose we have a data
table with only 8 observations. We sort
these observations from high to low based on the probability
that the Outcome = Yes. The sorted actual
response values are Yes, Yes, Yes, No, Yes, Yes, No and Yes.
This results in the ROC curve on the left
of Exhibit 14. Arrows have been added to show the steps in the
ROC curve construction. The first three
line segments are drawn up because the first three sorted values
have Outcome = Yes.
Now, suppose that we have a different probability model that
we use to rank the observations, resulting in
the sorted outcomes Yes, No, Yes, No, No, Yes, No and Yes.
The ROC curve for this situation is shown
on the right of Exhibit 14. The first ROC curve moves “up”
faster than the second curve. This is an
indication that the first model is doing a better job of separating
the Yes responses from the No
responses based on the predicted probability.
Exhibit 14 ROC Curve Examples
Referring back to the sample ROC curve in Exhibit 13, we see
that JMP has also displayed a diagonal
reference line on the chart, which represents the Sensitivity = 1-
Specificity line. If a probability model
cannot sort the data into the correct response category, then it
may be no better than simply sorting at
random. In this case, the ROC curve for a “random ranking”
model would be similar to this diagonal line.
A model that sorts the data perfectly, with all the Yes responses
at the top of the sorted table, would have
an ROC Curve that goes from the origin of the graph straight up
to sensitivity = 1, then straight over to 1-
specificity = 1. A model that sorts perfectly can be made into a
classifier rule that classifies perfectly; that
is, a classifier rule that has a sensitivity of 1.0 and 1-specificity
of 0.0.
14
The area under the curve, or AUC (labeled Area in Exhibit 13)
is a measure of how well our model sorts
the data. The diagonal line, which would represent a random
sorting model, has an AUC of 0.5. A perfect
sorting model has an AUC of 1.0. The area under the curve for
Offer Accepted = Yes is 0.7369 (see
Exhibit 13), indicating that the model predicts better than the
random sorting model.
The Lift Curve
Another measure of how well a model can sort outcomes is the
model lift. As with the ROC curve, we
examine the table that is sorted in descending order of predicted
probability. For each sorted row, we
calculate the sensitivity and divide that by the proportion of
values in the table whereby Offer Accepted =
Yes. This value is the model lift.
Lift is a measure of how much “richness” in the response we
achieve by applying a classification rule to
the data. A Lift Curve plots the Lift (on the y-axis) against the
Portion (on the x-axis). Again, consider
the data table that has been sorted by the predicted probability
of a given outcome. As we go down the
table from the top to the bottom, portion is the relative position
of the row that we are considering. The top
10 percent of rows in the sorted table corresponds to a portion
of 0.1, the top 20 percent of rows
corresponds to a portion of 0.2, and so on. The lift for Offer
Accepted = Yes for a given portion is simply
the proportion of Yes responses in this portion, divided by
overall proportion of Yes responses in the
entire data table.
The higher the lift at a given portion, the better our model is at
correctly classifying the outcome within this
portion. For Offer Accepted = Yes, the lift at Portion = 0.15 is
roughly 2.5 (see Exhibit 15). This means
that in rows in the data table corresponding to the top 15% of
the model’s predicted probabilities, the
number of actual Yes outcomes is 2.5 times higher than we
would expect if we had chosen 15 percent of
rows from the data set at random. If the model does not sort the
data well, then the lift will hover at
around 1.0 across all of portion values.
Exhibit 15 Lift Curve for Offer Accepted
Lift provides another measure of how good our model is at
classifying outcomes; it is particularly useful
when the overall predicted probabilities are lower than 0.5 for
the outcome that we wish to predict.
15
Though in this example the majority of the predicted
probabilities of Offer Accepted = Yes were less than
0.2, the lift curve indicates that there are threshold values that
we could use with the predicted probability
model to create a classifier rule that will be better than guessing
at random. This rule can be used to
identify portions of our data that contain a much richer number
of customers who are likely to accept an
offer.
For categorical response models, the misclassification rate,
confusion matrix, ROC curve and lift curve all
provide measures of model accuracy; each of these should be
used to assess the quality of the prediction
model.
Summary
Statistical Insights
In this case study, a classification tree was used to predict the
probability of an outcome based on a
set of predictor variables using the Partition platform. If the
response variable is continuous rather
than categorical, then a regression tree can be used to predict
the mean of the response.
Construction of regression trees is analogous to construction of
classification trees, however splits
are based on the mean response value rather than the probability
of outcome categories.
Implications
This model was created for explanatory rather than predictive
purposes. Our goal was to understand
the characteristics of customers most likely to accept a credit
card offer. In a predictive model, we are
more interested in creating a model that accurately predicts the
response (i.e., predicts future
customer behavior) than we are in identifying important
variables or characteristics.
JMP Features and Hints
In this case study we used the Columns Viewer and Distribution
Platforms to explore variables
one at a time, utilizing Fit Y by X to explore the relationship
between our response (or target variable)
and predictor variables. This exploratory work was only
partially completed herein.
We weighed the possibility of using the misclassification rate
and confusion matrix as overall
measures of model accuracy, and introduced the ROC and lift
curves as additional measures of
accuracy. As discussed, ROC and lift curves are particularly
useful in cases where the probability of
the target response category is low.
Note: The cutoff for classification used throughout JMP is 0.50.
In some modeling situations it may be
desirable to change the cutoff of classification (say, when the
probability of response is extremely
low). This effect can be achieved manually by saving the
prediction formula to the data table, and
then creating a new formula column that classifies the outcome
based on a specified cutoff. In the
sample formula below, JMP will classify an outcome as “Yes” if
the predicted probability of survival is
<= 0.30. The Tabulate platform (under Analyze) can then be
used to manually create a confusion
matrix.
An add-in from the JMP User Community can also be used to
change the cut-off for classification
(community.jmp.com/docs/DOC-6901). This add-in allows the
user to enter a range of values for the
16
cutoff, and produces confusion matrices for each cutoff value.
The goal is to find a cutoff that
minimizes the misclassification rate on the validation set.
Exercises
Exercise 1: Use the Credit Card Marketing BBM.jmp data set to
answer the following
questions:
a. The Column Contributions output after 15 splits is shown in
Exhibit 10. Interpret this
output. How can this information be used by the company?
What is the potential value of
identifying these characteristics?
b. Recreate the output shown in Exhibits 9-11, but instead use
the split button to manually
split. Create a classification tree with 25 splits.
c. How did the Column Contributions report change?
d. How did the Misclassification Rate and Confusion Matrix
change? How did the ROC or
Lift Curves change? Did these additional splits provide any
additional (useful)
information?
e. Why is this an exploratory model rather than a predictive
model? Describe the difference
between exploratory and predictive models.
Exercise 2: Use the Titanic Passengers.jmp data set in the JMP
Sample Data Library (under
the Help menu) for this exercise.
This data table describes the survival status of 1,309 of the
1,324 individual passengers on the
Titanic. Information on the 899 crew members is not included.
Some of the variables are
described below:
Name: Passenger Name
Survived: Yes or No
Passenger Class: 1, 2, or 3 corresponding to 1st, 2nd or 3rd
class
Sex: Passenger sex
Age: Passenger age
Siblings and Spouses: The number of siblings and spouses
aboard
Parents and Children: The number of parents and children
aboard
Fare: The passenger fare
Port: Port of embarkment (C = Cherbourg; Q = Queenstown; S =
Southampton)
Home/Destination: The home or final intended destination of
the passenger
Build a classification tree for Survived by determining which
variables to include as predictors.
Do not use model validation for this exercise. Use Column
Contributions and Split History to
determine the optimal number of splits.
a. Which variables, if any, did you choose not to include in the
model? Why?
b. How many splits are in your final tree?
c. Which variables are the largest contributors?
d. What is your final model? Save the prediction formula for
this model to the data table (we
will refer to it in the next exercise).
17
e. What is the misclassification rate for this model? Is the
model better at predicting survival
or non-survival? Explain.
f. What is the area under the ROC curve for Survived? Interpret
this value. Does the model
do a better job of classifying survival than a random model?
g. What is the lift for the model at portion = 0.1 and at portion
= 0.25? Interpret these values.
Exercise 3: Use the Titanic Passengers.jmp data set for this
exercise. Use the Fit Model
platform to create a logistic regression model for Survived?
using the other variables as
predictors. Include interaction terms you think might be
meaningful or significant in predicting the
probability of survival.
For information on fitting logistic regression models, see the
guide and video at
community.jmp.com/docs/DOC-6794.
a. Which variables are significant in predicting the probability
of survival? Are any of the
interaction terms significant?
b. What is the misclassification rate for your final the logistic
model?
c. Compare the misclassification rates for the logistic model and
the partition model created
in Exercise 2. Which model is better? Why?
d. Compare this model to the model produced using a
classification tree. Which model
would be easier to explain to a non-technical person? Why?
18
SAS Institute Inc. World Headquarters +1
919 677 8000
JMP is a software solution from SAS. To learn more about
SAS, visit www.sas.com
For JMP sales in the US and Canada, call 877 594 6567 or go to
www.jmp.com
SAS and all other SAS Institute Inc. product or service names
are registered trademarks or trademarks of SAS Institute Inc. in
the USA and other countries.
® indicates USA registration. Other brand and product names
are trademarks of their respective companies. S81971.1111
Project PlannerProject PlannerSelect a period to highlight at
right. A legend describing the charting follows. Period
Highlight:1Plan DurationActual Start% CompleteActual
(beyond plan)% Complete (beyond plan)ACTIVITYPLAN
STARTPLAN DURATIONACTUAL STARTACTUAL
DURATIONPERCENT
COMPLETEPERIODS1234567891011121314151617181920212
22324252627282930313233343536373839404142434445464748
495051525354555657585960Activity 01151425%Activity
021616100%Activity 03242535%Activity 04484610%Activity
05424885%Activity 06434685%Activity 07545350%Activity
08525560%Activity 09525675%Activity 106567100%Activity
11615860%Activity 1293930%Activity 13969750%Activity
1493910%Activity 1594851%Activity 1610510380%Activity
171121150%Activity 181261270%Activity 191211250%Activity
201451460%Activity 2114814244%Activity
221471430%Activity 2315415812%Activity
241551535%Activity 251581550%Activity 261628163050%
Case 2 - Baggage Complaints:
Descriptive Statistics and Time Series
Plots
Marlene Smith, University of Colorado Denver
Business School
2
Baggage Complaints:
Descriptive Statistics and Time Series Plots
Background
Anyone who travels by air knows that occasional problems are
inevitable. Flights can be delayed or
cancelled due to weather conditions, mechanical problems, or
labor strikes, and baggage can be lost,
delayed, damaged, or pilfered. Given that many airlines are
now charging for bags, issues with baggage
are particularly annoying. Baggage problems can have a serious
impact on customer loyalty, and can be
costly to the airlines (airlines often have to deliver bags).
Air carriers report flight delays, cancellations, overbookings,
late arrivals, baggage complaints, and other
operating statistics to the U.S. government, which compiles the
data and reports it to the public.
The Task
Do some airlines do a better job of handling baggage? Compare
the baggage complaints for three
airlines: American Eagle, Hawaiian, and United. Which airline
has the best record? The worst? Are
complaints getting better or worse over time? Are there other
factors, such as destinations, seasonal
effects or the volume of travelers that affect baggage
performance?
The Data Baggage Complaints.jmp
The data set contains monthly observations from 2004 to 2010
for United Airlines, American Eagle, and
Hawaiian Airlines. The variables in the data set include:
Baggage The total number of passenger complaints for theft of
baggage contents, or for
lost, damaged, or misrouted luggage for the airline that month
Scheduled The total number of flights scheduled by that airline
that month
Cancelled The total number of flights cancelled by that airline
that month
Enplaned The total number of passengers who boarded a plane
with the airline that month
These data are available from the U.S. Department of
Transportation, “Air Travel Consumer Report,” the
Office of Aviation Enforcement and Proceedings, Aviation
Consumer Protection Division
(http://airconsumer.dot.gov/reports/index.htm). The data for
baggage complaints and enplaned
passengers cover domestic travel only.
3
Analysis
We start by exploring baggage complaints over time.
Exhibit 1 shows the time series plot for the variable Baggage by
Date for each of the airlines. United
Airlines has the most complaints about mishandled baggage in
almost all of the months in the data set;
Hawaiian Airlines has the fewest number of complaints in all
months. Do we conclude, then, that United
Airlines has the “worst record” for mishandled baggage and
Hawaiian, the best?
Exhibit 1 Time Series Plots of Baggage by Airline
(Graph > Graph Builder; drag and drop Baggage in Y, Date in X
and Airline in Overlay. Click on the smoother icon
at the top to remove the smoother, and click on the line icon.
Or, right-click in the graph, and select Smoother >
Remove, and Points > Change to > Line. Then, click Done.)
United Airlines is a much bigger airline, as evidenced by
Exhibit 2, which shows the average number of
scheduled flights and enplaned passengers by airline. United
handles more than three times the number
of passengers than American Eagle on average, and almost eight
times more than Hawaiian. Thus,
United has more opportunities to mishandle luggage because it
handles more luggage – it’s simply a
much bigger airline.
Exhibit 2 Average Scheduled Flights and Enplaned Passengers
by Airline
(Analyze > Tabulate; drag Airline in drop zone for
rows, and Scheduled and Enplaned in the drop
zone for column as analysis columns. Then, drag
Mean from the middle panel to the middle of the
table.
Note that in JMP versions 10 and earlier Tabulate is
under the Tables menu.)
4
To adjust for size we calculate the rate of baggage complaints
(Exhibit 3): Baggage % = 100 ×
(Baggage/Enplaned).
Exhibit 3 Calculating Baggage %
(Create a new column in the data table, and
rename it Baggage %. Right click on the
column header, and select Formula to open
the Formula Editor. To create the formula:
1. Type 100
2. Select multiply on the keypad
3. Select Baggage from the columns list
4. Select divide by on the keypad
5. Select Enplaned from the columns list
6. Click OK.)
In Exhibit 4 we compare the records of the three airlines using
Baggage %. We see that American
Eagle has the highest rate of baggage complaints when adjusted
for number of enplaned passengers.
Exhibit 4 Average Baggage % by Airline
Plotting the Baggage % on a time series plot allows us to see
changes over time. In Exhibit 5 we see
that baggage complaint rates from American Eagle and United
passengers increased through 2006 and
began declining thereafter.
5
Exhibit 5 Time Series Plots of Baggage % by Airline
The time series for Hawaiian passengers in Exhibit 5 is
relatively flat compared to American Eagle and
United, so it’s difficult to detect a pattern over time. In Exhibit
6 we isolate the data for the Hawaiian
flights. Complaint rates for Hawaiian passengers began to drop
in the summer of 2008 until fall of 2010,
after which the rate of complaints returned to historical levels.
Exhibit 6 Data Filter and Time Series Plot of Baggage % -
Hawaiian Airline
(Rows > Data Filter; select Airline
and click Add. Then, select
Hawaiian and check the Show and
Include boxes.)
6
Let’s return to Exhibit 5: Do we see any other patterns in
baggage complaint rates over time? The pattern
of spikes and dips indicates that changes in the rate of baggage
complaints may have a seasonal
component. In Exhibit 7 we plot the average Baggage % by
month for the three airlines to investigate
this further.
Average rates are highest in December and January and in the
summer months, and lowest in the spring
and late fall. Interestingly, all three airlines seem to follow the
same general pattern. (Using the Data
Filter to zoom in on Hawaiian will make this more apparent.)
Exhibit 7 Time Series Plots of Monthly Average Baggage % by
Airline
(Graph > Graph Builder; drag and drop Baggage % in Y, Month
in X and Airline in Group X. Then, click on the line icon at the
top.
Or, right-click in the graph, and select Points > Change to >
Line. Then, click Done.)
Summary
Statistical Insights
It is important to ask a lot of pointed questions about the
purpose and scope of a project before
jumping into data analysis. How will the results be used? And,
is the information available to answer
the questions being asked? The data set provided did not have
information on flight destinations, so
we are not able to investigate whether destination is related to
baggage issues.
To provide a fair comparison of performance across airlines,
it’s best to standardize for differences in
volume. It is also important to pay close attention to units of
measurement, as will be seen in an
exercise.
Managerial Implications
When asking a data analyst to conduct a study, ensure that the
purpose is clear. What, specifically,
do you want to know? And, how will the information provided
by the analyst be used to guide the
decision-making process?
What did we learn from the analysis? Managers at American
Eagle should note the relatively high
rate of baggage complaints and consider costs and benefits of
improvements. All three airlines
7
should anticipate high rates of baggage complaints in January,
February and the summer months and
should plan accordingly. If we are interested in studying
baggage complaints for different
destinations, additional data are required.
JMP Features and Hints
This case uses the Graph Builder to compare multiple time
series and Tabulate to produce summary
statistics by groups. Data Filter is used with the Graph Builder
to further investigate a particular
group.
Exercises
1. Exhibit 4 shows that the average Baggage % for American
Eagle is 1.033. Provide a
nontechnical interpretation of this number by using it in this
sentence: “1.033 is the…” Pay
particular attention to units of measurement.
2. Use Tabulate to calculate the standard deviations of Baggage
% for each airline. Provide a non-
technical interpretation of the standard deviations. Is one
airline more variable than the others?
3. Compare the three airlines based on cancelled flights: Which
airline has the best record? The
worst?
8
SAS Institute Inc. World Headquarters +1
919 677 8000
JMP is a software solution from SAS. To learn more about
SAS, visit www.sas.com
For JMP sales in the US and Canada, call 877 594 6567 or go to
www.jmp.com
SAS and all other SAS Institute Inc. product or service names
are registered trademarks or trademarks of SAS Institute Inc. in
the USA and other countries.
® indicates USA registration. Other brand and product names
are trademarks of their respective companies. S81971.1111
Case 8 - Contributions:
Simple Linear Regression and Time
Series
Marlene Smith, University of Colorado Denver
Business School
2
Contributions1:
Simple Linear Regression and Time Series
Background
The Colorado Combined Campaign solicits Colorado
government employees’ participation in a fund-
raising drive. Funds raised by the campaign go to over 700
Colorado charities in all, including the
Humane Society of Boulder Valley and the Denver Children’s
Advocacy Center. Prominent state
employees, such as university presidents, chancellors and
lieutenant governors, head the annual
campaigns. An advisory committee determines whether the
charities receiving contributions provide the
services claimed in a fiscally responsible manner.
All Colorado state employees may contribute to the fund.
However, certain state institutions are targeted
to receive promotional brochures and campaign literature.
Employees in these targeted groups are
referred to as “eligible” employees. Each year, the number of
eligible employees is known in June.
Fund-raising activities are then conducted throughout the fall.
By year’s end, total contributions raised
that year are tabulated.
The Task
It is now June 2010. The number of eligible employees for
2010 has been determined to be 53,455.
Does knowing the number of eligible employees help predict
2010 year-end contributions?
The Data
Contributions.jmp
This is an annual time-series from 1988 – 2009. The variables
are contribution Year and:
Actual Total contributions to the campaign for the year in
dollars
Employees Number of eligible employees that year
Analysis
The average level of contributions during this time period was
$1,143,769, with a typical fluctuation of
$339,788 around the average. The average number of eligible
employees was 45,419, with a typical
fluctuation of 9,791.
Exhibit 1 Summary Statistics for Actual and Employees
(Analyze > Tabulate; drag Actual and Employees in drop
zone for rows as analysis columns. Then, drag Mean and
Std Dev from the middle panel to drop zone for columns.
Note that in JMP versions 10 and earlier Tabulate is under
the Tables menu.)
1Mel Rael, Executive Director of the Colorado Combined
Campaign, graciously provided these data.
3
As we can see in Exhibit 2, contributions are growing over
time:
Exhibit 2 Time Series Plot of Actual by Year
(Graph > Graph Builder; drag and drop Actual in Y and Year in
X. Click on the smoother icon at
the top to remove the smoother. Hold the shift key and click
the line icon to add a line. Or, right
click in the graph to select these options. Then, click Done.)
The long-term growth in contributions is attributable to two
phenomena:
• The amount contributed per eligible employee is mostly
upward (Exhibit 3, top).
• The number of eligible employees is on the rise, particularly
in the 1999 to 2002 campaign years
(Exhibit 3, bottom).
Exhibit 3 Time Series Plots of Actual per Employee and
Employees
(Create a new column and rename it Actual
per Employee, then use the Formula Editor
to create the formula – Actual divided by
Employees.
Follow the instructions for Exhibit 2 to create
the graph for Employees. Then, click and
drag Actual per Employee above
Employees in Y, and release.
To change the markers for points, use the
lasso from the toolbar to select the points
(draw a circle around them). Then go to
Rows > Markers and select a marker.)
4
The scatterplot and least squares regression line using Actual as
the response variable and Employees
as the predictor variable is shown in Exhibit 4. The formula for
the regression line is found below the plot
under Linear Fit. The slope of the fitted line, 33.555, estimates
the contribution for each eligible employee
over this time period. Hence, the model estimates an additional
$33.56 in contributions for each eligible
employee. Under Parameter Estimates, we see that the number
of employees is a statistically significant
predictor of year-end contributions; the p-value, listed as Prob >
|t|, is < 0.0001.
The number of employees doesn’t perfectly predict
contributions. Just over 93% of the variability in
contributions is associated with variability in number of eligible
employees (RSquare = 0.934907).
Comparing the standard deviation of Actual ($339,788) to the
root mean square of the regression
equation ((RMSE = $88,832) suggests that a substantial
reduction in the variation in contributions occurs
by using the regression model to explain variation in year-end
contributions.
Exhibit 4 Regression with Actual (Y) and Employees (X)
(Analyze > Fit Y by X. Use Actual as Y,
Response and Employees as X, Factor.
Under the red triangle select Fit Line. Note:
To remove the markers in Exhibit 3, go to the
Rows menu and select Clear Row States.)
We’ve been informed that the number of eligible employees in
2010 is 53,455. To use the regression
equation to forecast 2010 year-end contributions, we can plug
this number into the regression equation.
5
If the number of Employees is 53,455, the predicted Actual
contributions is:
Actual = -380265.5 + (33.555042) x Employees
= -380265.5 + (33.555042) x (53,455)
= 1413419.3 (or, $1,413,419)
In words, given that the number of eligible employees is 53,455,
our model estimates that 2010 year-end
contributions will be approximately $1.413 million.
Easier still, we can skip the math exercise, save the regression
formula and prediction intervals and ask
JMP to calculate the estimated contributions for 2010 (Exhibit
5). Prediction intervals are useful, since
the number of employees isn’t a perfect predictor of
contributions. The prediction interval gives us an
estimate of the interval in which the 2010 year-end
contributions will fall (with 95% confidence).
Exhibit 5 Predicted Value and Prediction Interval for 2010
Contribution
(In the Bivariate Fit window, select Save Predicteds under the
red Triangle for Linear Fit. JMP will create a new column with
the
prediction formula for Actual. Create a new row and enter a
value for Employees – the predicted value for Actual will
display. To
save prediction intervals, use Analyze > Fit Model; select
Actual as Y and Employees as a model effect, and hit Run.
Under the
red triangle select Save Columns > Indiv Confidence Intervals.)
Predicted values can also be explored dynamically using the
cross-hair tool. In Exhibit 6, we see that the
predicted value for Actual, if Employees is 53,414, is around
$1.402 million.
Exhibit 6 Using Cross-hair Tool to Explore Predicted
Contribution
(Select the cross-hair tool on the toolbar. Click
on the regression line at the value of the
predictor to see the predicted response value.)
6
We can also graphically explore prediction intervals (Exhibit 7).
Exhibit 7 Prediction Intervals for Actual
(In the Bivariate Fit window, select
Confid Curves Indiv under the red
Triangle next to Linear Fit. Use
the cross-hairs to find the upper
and lower bounds for the
prediction interval.)
Summary
Statistical Insights
Forecasting using regression involves substituting known or
hypothetical values for X into the
regression equation and solving for Y. In this case, values for
the predictor variable in the forecasting
horizon are known in advance; i.e., we know that the 2010 value
for Employees is 53,455, so we
plugged this value into the regression equation to forecast year-
end contributions. In another setting,
in which the same-year value for X is unknown, how would we
proceed? One possibility is to forecast
the value of the predictor variable. Another possibility, when
theoretically and statistically justified, is
to use lagged values of the original predictor variables in the
regression model.
When building any regression model, residuals should be
checked to ensure that the linear fit makes
sense.
Managerial Implications
Regression has provided a prediction for year-end 2010
Colorado Combined Campaign contributions
of $1.4M. In managerial settings such as this, where the
response variable represents a business
goal, managers often set higher expectations than the predicated
value to motivate improved
performance. One such choice here might be the upper 95%
prediction limit of $1.6M.
This forecasting methodology can be repeated year after year.
Once the final contributions to 2010
are known, they can be added to the data set and the regression
line can be recalculated. By
midyear of 2011, the number of eligible employees will be
known.
7
Note that, in this case, we focused on trend analysis using only
Year as the predictor. We could also
fit a model with both Employee and Year. We will consider
regression models with more than one
predictor in a future case.
JMP Features and Hints
In this case we used Fit Y by X to develop a regression model.
We used cross-hairs tool to explore
the predicted value of the response at a given value of the
predictor. Several options, such as saving
predicted values and showing prediction intervals, are available
under the red triangle for the fitted
line. When the prediction formula is saved, a new column with
the regression formula is created.
Enter the value of X in a new row in the JMP data table, and the
predicted value will display. To save
prediction intervals to the data table for the value of X, use Fit
Model.
Note that other intervals and model diagnostics are also
available from both Fit Y by X and Fit Model.
To generate residual plots from within Fit Y by X, select the
option under the red triangle next to
Linear Fit.
Exercises
A regression trend analysis uses only the information contained
in the passage of time to predict a
response variable.
1. Perform a trend analysis with the Colorado Combined
Campaign data, using Actual as the
response variable and Year as the predictor.
2. Forecast the 2010 - 2013 Colorado Combined Campaign
contributions.
3. Compare your forecast for 2010 with that obtained from the
simple linear regression model in
which number of eligible employees is the predictor variable.
Hint: Compare RMSE, RSquare,
and the estimated contributions for 2010. Which model does a
better job of explaining variation in
contributions?
4. We’ve limited our analyses to one predictor variable at a
time. Guestimate what would happen, in
terms of RMSE, RSquare and model predictions if we were to
build a model with both Year and
Employees.
8
SAS Institute Inc. World Headquarters +1
919 677 8000
JMP is a software solution from SAS. To learn more about
SAS, visit www.sas.com
For JMP sales in the US and Canada, call 877 594 6567 or go to
www.jmp.com
SAS and all other SAS Institute Inc. product or service names
are registered trademarks or trademarks of SAS Institute Inc. in
the USA and other countries.
® indicates USA registration. Other brand and product names
are trademarks of their respective companies. S81971.1111

More Related Content

Similar to Credit Card Marketing Classification Trees Explained

Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paperShubhashish Biswas
 
Hair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptxHair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptxAsadAli104515
 
Consumer credit-risk3440
Consumer credit-risk3440Consumer credit-risk3440
Consumer credit-risk3440stone55
 
Creditscore
CreditscoreCreditscore
Creditscorekevinlan
 
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptxModule_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptxHarshitGoel87
 
Creating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmCreating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmBill Fite
 
Explainable Machine Learning
Explainable Machine LearningExplainable Machine Learning
Explainable Machine LearningBill Fite
 
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYPROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYIJITCA Journal
 
Detection of credit card fraud
Detection of credit card fraudDetection of credit card fraud
Detection of credit card fraudBastiaan Frerix
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsJieming Wei
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining ProcessMarc Berman
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptxAniket Patil
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptxpatilaniket2418
 
Modeling for the Non-Statistician
Modeling for the Non-StatisticianModeling for the Non-Statistician
Modeling for the Non-StatisticianAndrew Curtis
 

Similar to Credit Card Marketing Classification Trees Explained (20)

Predictive analytics-white-paper
Predictive analytics-white-paperPredictive analytics-white-paper
Predictive analytics-white-paper
 
Hair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptxHair_EOMA_1e_Chap001_PPT.pptx
Hair_EOMA_1e_Chap001_PPT.pptx
 
Consumer credit-risk3440
Consumer credit-risk3440Consumer credit-risk3440
Consumer credit-risk3440
 
Creditscore
CreditscoreCreditscore
Creditscore
 
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptxModule_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
Module_6_-_Datamining_tasks_and_tools_uGuVaDv4iv-2.pptx
 
Credit iconip
Credit iconipCredit iconip
Credit iconip
 
Creating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning AlgorithmCreating an Explainable Machine Learning Algorithm
Creating an Explainable Machine Learning Algorithm
 
Explainable Machine Learning
Explainable Machine LearningExplainable Machine Learning
Explainable Machine Learning
 
Segmentation
SegmentationSegmentation
Segmentation
 
Segmentation
SegmentationSegmentation
Segmentation
 
Credit scorecard
Credit scorecardCredit scorecard
Credit scorecard
 
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYPROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
 
Detection of credit card fraud
Detection of credit card fraudDetection of credit card fraud
Detection of credit card fraud
 
scrib.pptx
scrib.pptxscrib.pptx
scrib.pptx
 
Predictive analytics
Predictive analyticsPredictive analytics
Predictive analytics
 
Accurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification AlgorithmsAccurate Campaign Targeting Using Classification Algorithms
Accurate Campaign Targeting Using Classification Algorithms
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
Customer_Churn_prediction.pptx
Customer_Churn_prediction.pptxCustomer_Churn_prediction.pptx
Customer_Churn_prediction.pptx
 
Modeling for the Non-Statistician
Modeling for the Non-StatisticianModeling for the Non-Statistician
Modeling for the Non-Statistician
 

More from ShiraPrater50

Read Chapter 3. Answer the following questions1.Wha.docx
Read Chapter 3. Answer the following questions1.Wha.docxRead Chapter 3. Answer the following questions1.Wha.docx
Read Chapter 3. Answer the following questions1.Wha.docxShiraPrater50
 
Read Chapter 15 and answer the following questions 1.  De.docx
Read Chapter 15 and answer the following questions 1.  De.docxRead Chapter 15 and answer the following questions 1.  De.docx
Read Chapter 15 and answer the following questions 1.  De.docxShiraPrater50
 
Read Chapter 2 and answer the following questions1.  List .docx
Read Chapter 2 and answer the following questions1.  List .docxRead Chapter 2 and answer the following questions1.  List .docx
Read Chapter 2 and answer the following questions1.  List .docxShiraPrater50
 
Read chapter 7 and write the book report  The paper should be .docx
Read chapter 7 and write the book report  The paper should be .docxRead chapter 7 and write the book report  The paper should be .docx
Read chapter 7 and write the book report  The paper should be .docxShiraPrater50
 
Read Chapter 7 and answer the following questions1.  What a.docx
Read Chapter 7 and answer the following questions1.  What a.docxRead Chapter 7 and answer the following questions1.  What a.docx
Read Chapter 7 and answer the following questions1.  What a.docxShiraPrater50
 
Read chapter 14, 15 and 18 of the class textbook.Saucier.docx
Read chapter 14, 15 and 18 of the class textbook.Saucier.docxRead chapter 14, 15 and 18 of the class textbook.Saucier.docx
Read chapter 14, 15 and 18 of the class textbook.Saucier.docxShiraPrater50
 
Read Chapter 10 APA FORMAT1. In the last century, what historica.docx
Read Chapter 10 APA FORMAT1. In the last century, what historica.docxRead Chapter 10 APA FORMAT1. In the last century, what historica.docx
Read Chapter 10 APA FORMAT1. In the last century, what historica.docxShiraPrater50
 
Read chapter 7 and write the book report  The paper should b.docx
Read chapter 7 and write the book report  The paper should b.docxRead chapter 7 and write the book report  The paper should b.docx
Read chapter 7 and write the book report  The paper should b.docxShiraPrater50
 
Read Chapter 14 and answer the following questions1.  Explain t.docx
Read Chapter 14 and answer the following questions1.  Explain t.docxRead Chapter 14 and answer the following questions1.  Explain t.docx
Read Chapter 14 and answer the following questions1.  Explain t.docxShiraPrater50
 
Read Chapter 2 first. Then come to this assignment.The first t.docx
Read Chapter 2 first. Then come to this assignment.The first t.docxRead Chapter 2 first. Then come to this assignment.The first t.docx
Read Chapter 2 first. Then come to this assignment.The first t.docxShiraPrater50
 
Journal of Public Affairs Education 515Teaching Grammar a.docx
 Journal of Public Affairs Education 515Teaching Grammar a.docx Journal of Public Affairs Education 515Teaching Grammar a.docx
Journal of Public Affairs Education 515Teaching Grammar a.docxShiraPrater50
 
Learner Guide TLIR5014 Manage suppliers TLIR.docx
 Learner Guide TLIR5014 Manage suppliers TLIR.docx Learner Guide TLIR5014 Manage suppliers TLIR.docx
Learner Guide TLIR5014 Manage suppliers TLIR.docxShiraPrater50
 
Lab 5 Nessus Vulnerability Scan Report © 2012 by Jone.docx
 Lab 5 Nessus Vulnerability Scan Report © 2012 by Jone.docx Lab 5 Nessus Vulnerability Scan Report © 2012 by Jone.docx
Lab 5 Nessus Vulnerability Scan Report © 2012 by Jone.docxShiraPrater50
 
Leveled and Exclusionary Tracking English Learners Acce.docx
 Leveled and Exclusionary Tracking English Learners Acce.docx Leveled and Exclusionary Tracking English Learners Acce.docx
Leveled and Exclusionary Tracking English Learners Acce.docxShiraPrater50
 
Lab 5 Nessus Vulnerability Scan Report © 2015 by Jone.docx
 Lab 5 Nessus Vulnerability Scan Report © 2015 by Jone.docx Lab 5 Nessus Vulnerability Scan Report © 2015 by Jone.docx
Lab 5 Nessus Vulnerability Scan Report © 2015 by Jone.docxShiraPrater50
 
MBA 6941, Managing Project Teams 1 Course Learning Ou.docx
 MBA 6941, Managing Project Teams 1 Course Learning Ou.docx MBA 6941, Managing Project Teams 1 Course Learning Ou.docx
MBA 6941, Managing Project Teams 1 Course Learning Ou.docxShiraPrater50
 
Inventory Decisions in Dells Supply ChainAuthor(s) Ro.docx
 Inventory Decisions in Dells Supply ChainAuthor(s) Ro.docx Inventory Decisions in Dells Supply ChainAuthor(s) Ro.docx
Inventory Decisions in Dells Supply ChainAuthor(s) Ro.docxShiraPrater50
 
It’s Your Choice 10 – Clear Values 2nd Chain Link- Trade-offs .docx
 It’s Your Choice 10 – Clear Values 2nd Chain Link- Trade-offs .docx It’s Your Choice 10 – Clear Values 2nd Chain Link- Trade-offs .docx
It’s Your Choice 10 – Clear Values 2nd Chain Link- Trade-offs .docxShiraPrater50
 
MBA 5101, Strategic Management and Business Policy 1 .docx
 MBA 5101, Strategic Management and Business Policy 1 .docx MBA 5101, Strategic Management and Business Policy 1 .docx
MBA 5101, Strategic Management and Business Policy 1 .docxShiraPrater50
 
MAJOR WORLD RELIGIONSJudaismJudaism (began .docx
 MAJOR WORLD RELIGIONSJudaismJudaism (began .docx MAJOR WORLD RELIGIONSJudaismJudaism (began .docx
MAJOR WORLD RELIGIONSJudaismJudaism (began .docxShiraPrater50
 

More from ShiraPrater50 (20)

Read Chapter 3. Answer the following questions1.Wha.docx
Read Chapter 3. Answer the following questions1.Wha.docxRead Chapter 3. Answer the following questions1.Wha.docx
Read Chapter 3. Answer the following questions1.Wha.docx
 
Read Chapter 15 and answer the following questions 1.  De.docx
Read Chapter 15 and answer the following questions 1.  De.docxRead Chapter 15 and answer the following questions 1.  De.docx
Read Chapter 15 and answer the following questions 1.  De.docx
 
Read Chapter 2 and answer the following questions1.  List .docx
Read Chapter 2 and answer the following questions1.  List .docxRead Chapter 2 and answer the following questions1.  List .docx
Read Chapter 2 and answer the following questions1.  List .docx
 
Read chapter 7 and write the book report  The paper should be .docx
Read chapter 7 and write the book report  The paper should be .docxRead chapter 7 and write the book report  The paper should be .docx
Read chapter 7 and write the book report  The paper should be .docx
 
Read Chapter 7 and answer the following questions1.  What a.docx
Read Chapter 7 and answer the following questions1.  What a.docxRead Chapter 7 and answer the following questions1.  What a.docx
Read Chapter 7 and answer the following questions1.  What a.docx
 
Read chapter 14, 15 and 18 of the class textbook.Saucier.docx
Read chapter 14, 15 and 18 of the class textbook.Saucier.docxRead chapter 14, 15 and 18 of the class textbook.Saucier.docx
Read chapter 14, 15 and 18 of the class textbook.Saucier.docx
 
Read Chapter 10 APA FORMAT1. In the last century, what historica.docx
Read Chapter 10 APA FORMAT1. In the last century, what historica.docxRead Chapter 10 APA FORMAT1. In the last century, what historica.docx
Read Chapter 10 APA FORMAT1. In the last century, what historica.docx
 
Read chapter 7 and write the book report  The paper should b.docx
Read chapter 7 and write the book report  The paper should b.docxRead chapter 7 and write the book report  The paper should b.docx
Read chapter 7 and write the book report  The paper should b.docx
 
Read Chapter 14 and answer the following questions1.  Explain t.docx
Read Chapter 14 and answer the following questions1.  Explain t.docxRead Chapter 14 and answer the following questions1.  Explain t.docx
Read Chapter 14 and answer the following questions1.  Explain t.docx
 
Read Chapter 2 first. Then come to this assignment.The first t.docx
Read Chapter 2 first. Then come to this assignment.The first t.docxRead Chapter 2 first. Then come to this assignment.The first t.docx
Read Chapter 2 first. Then come to this assignment.The first t.docx
 
Journal of Public Affairs Education 515Teaching Grammar a.docx
 Journal of Public Affairs Education 515Teaching Grammar a.docx Journal of Public Affairs Education 515Teaching Grammar a.docx
Journal of Public Affairs Education 515Teaching Grammar a.docx
 
Learner Guide TLIR5014 Manage suppliers TLIR.docx
 Learner Guide TLIR5014 Manage suppliers TLIR.docx Learner Guide TLIR5014 Manage suppliers TLIR.docx
Learner Guide TLIR5014 Manage suppliers TLIR.docx
 
Lab 5 Nessus Vulnerability Scan Report © 2012 by Jone.docx
 Lab 5 Nessus Vulnerability Scan Report © 2012 by Jone.docx Lab 5 Nessus Vulnerability Scan Report © 2012 by Jone.docx
Lab 5 Nessus Vulnerability Scan Report © 2012 by Jone.docx
 
Leveled and Exclusionary Tracking English Learners Acce.docx
 Leveled and Exclusionary Tracking English Learners Acce.docx Leveled and Exclusionary Tracking English Learners Acce.docx
Leveled and Exclusionary Tracking English Learners Acce.docx
 
Lab 5 Nessus Vulnerability Scan Report © 2015 by Jone.docx
 Lab 5 Nessus Vulnerability Scan Report © 2015 by Jone.docx Lab 5 Nessus Vulnerability Scan Report © 2015 by Jone.docx
Lab 5 Nessus Vulnerability Scan Report © 2015 by Jone.docx
 
MBA 6941, Managing Project Teams 1 Course Learning Ou.docx
 MBA 6941, Managing Project Teams 1 Course Learning Ou.docx MBA 6941, Managing Project Teams 1 Course Learning Ou.docx
MBA 6941, Managing Project Teams 1 Course Learning Ou.docx
 
Inventory Decisions in Dells Supply ChainAuthor(s) Ro.docx
 Inventory Decisions in Dells Supply ChainAuthor(s) Ro.docx Inventory Decisions in Dells Supply ChainAuthor(s) Ro.docx
Inventory Decisions in Dells Supply ChainAuthor(s) Ro.docx
 
It’s Your Choice 10 – Clear Values 2nd Chain Link- Trade-offs .docx
 It’s Your Choice 10 – Clear Values 2nd Chain Link- Trade-offs .docx It’s Your Choice 10 – Clear Values 2nd Chain Link- Trade-offs .docx
It’s Your Choice 10 – Clear Values 2nd Chain Link- Trade-offs .docx
 
MBA 5101, Strategic Management and Business Policy 1 .docx
 MBA 5101, Strategic Management and Business Policy 1 .docx MBA 5101, Strategic Management and Business Policy 1 .docx
MBA 5101, Strategic Management and Business Policy 1 .docx
 
MAJOR WORLD RELIGIONSJudaismJudaism (began .docx
 MAJOR WORLD RELIGIONSJudaismJudaism (began .docx MAJOR WORLD RELIGIONSJudaismJudaism (began .docx
MAJOR WORLD RELIGIONSJudaismJudaism (began .docx
 

Recently uploaded

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxSayali Powar
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 

Recently uploaded (20)

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptxPOINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
POINT- BIOCHEMISTRY SEM 2 ENZYMES UNIT 5.pptx
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

Credit Card Marketing Classification Trees Explained

  • 1. Credit Card Marketing Classification Trees From Building Better Models with JMP® Pro, Chapter 6, SAS Press (2015). Grayson, Gardner and Stephens. Used with permission. For additional information, see community.jmp.com/docs/DOC-7562. 2 Credit Card Marketing Classification Trees Key ideas: Classification trees, validation, confusion matrix, misclassification, leaf report, ROC curves, lift curves.
  • 2. Background A bank would like to understand the demographics and other characteristics associated with whether a customer accepts a credit card offer. Observational data is somewhat limited for this kind of problem, in that often the company sees only those who respond to an offer. To get around this, the bank designs a focused marketing study, with 18,000 current bank customers. This focused approach allows the bank to know who does and does not respond to the offer, and to use existing demographic data that is already available on each customer. The designed approach also allows the bank to control for other potentially important factors so that the offer combination isn’t confused or confounded with the demographic factors. Because of the size of the data and the possibility that there are complex relationships between the response and the studied factors, a decision tree is used to find out if there is a smaller subset of factors that may be more important and that warrant further analysis and study. The Task We want to build a model that will provide insight into why some bank customers accept credit card offers. Because the response is categorical (either Yes or No) and we have a large number of potential predictor variables, we use the Partition platform to build a classification tree for Offer Accepted. We are primarily interested in understanding characteristics of customers who have accepted an offer, so the resulting model will be exploratory in nature.1
  • 3. The Data Credit Card Marketing BBM.jmp The data set consists of information on the 18,000 current bank customers in the study. Customer Number: A sequential number assigned to the customers (this column is hidden and excluded – this unique identifier will not be used directly). Offer Accepted: Did the customer accept (Yes) or reject (No) the offer. Reward: The type of reward program offered for the card. Mailer Type: Letter or postcard. Income Level: Low, Medium or High. # Bank Accounts Open: How many non-credit-card accounts are held by the customer. 1 In exploratory modeling, the goal is to understand the variables or characteristics that drive behaviors or particular outcomes. In predictive modeling, the goal is to accurately predict new observations and future behaviors, given the current information and situation. 3 Overdraft Protection: Does the customer have overdraft protection on their checking account(s)
  • 4. (Yes or No). Credit Rating: Low, Medium or High. # Credit Cards Held: The number of credit cards held at the bank. # Homes Owned: The number of homes owned by the customer. Household Size: Number of individuals in the family. Own Your Home: Does the customer own their home? (Yes or No). Average Balance: Average account balance (across all accounts over time). Q1, Q2, Q3 and Q4 Balance: Average balance for each quarter in the last year. Prepare for Modeling We start by getting to know our data. We explore the data one variable at a time, two at a time, and many variables at a time to gain an understanding of data quality and of potential relationships. Since the focus of this case study is classification trees, only some of this work is shown here. We encourage you to thoroughly understand your data and take the necessary steps to prepare your data for modeling before building exploratory or predictive models. Exploring Data One Variable at a Time Since we have a relatively large data set with many potential predictors, we start by creating numerical summaries of each of our variables using the Columns Viewer (see Exhibit 1). (Under the Cols menu select Columns Viewer, then select all variables and click Show Summary. To deselect the variables, click Clear Select). Exhibit 1 Credit, Summary Statistics for All Variables With
  • 5. Columns Viewer 4 Under N Categories, we see that each of our categorical variables has either two or three levels. N Missing indicates that we are missing 24 observations for each of the balance columns. (Further investigation indicates that these values are missing from the same 24 customers.) The other statistics provide an idea of the centering, spread and shapes of the continuous distributions. Next, we graph our variables one at a time. (Select the variables within the Columns Viewer and click on the Distribution button. Or, use Analyze > Distribution, select all of the variables as Y, Columns, and click OK. Click Stack from the top red triangle for a horizontal layout). In Exhibit 2, we see that only around 5.68 percent of the 18,000 offers were accepted. Exhibit 2 Credit, Distribution of Offer Accepted We select the Yes level in Offer Accepted and then examine the distribution of accepted offers (the shaded area) across the other variables in our data set (the first
  • 6. 10 variables are shown in Exhibit 3). Our two experimental variables are Reward and Mailer Type. Offers promoting Points and Air Miles are more frequently accepted than those promoting Cash Back, while Postcards are accepted more often than Letters. Offers also appear to be accepted at a higher rate by customers with low to medium income, no overdraft protection and low credit ratings. Exhibit 3 Credit, Distribution of First 10 Variables 5 Note that both Credit Rating and Income Level are coded as Low, Medium and High. Peeking at the data table, we see that the modeling types for these variables are both nominal, but since the values are ordered categories they should be coded as ordinal variables. To change modeling type, right-click on the modeling type icon in the data table or in any dialog window, and then select the correct modeling type. Exploring Data Two Variables at a Time We explore relationships between our response and potential predictor variables using Analyze > Fit Y by X (select Offer Accepted as Y, Response and the predictors
  • 7. as X, Factor, and click OK.) For categorical predictors (nominal or ordinal), Fit Y by X conducts a contingency analysis (see Exhibit 4). For continuous predictors, Fit Y by X fits a logistic regression model. The first two analyses in Exhibit 4 show potential relationships between Offer Accepted and Reward (left) and Offer Accepted and Mailer Type (right). Note that tests for association between the categorical predictors and Offer Accepted are also provided by default (these are not shown in Exhibit 4), and additional statistical options and tests are provided under the top red triangles. Exhibit 4 Credit Fit Y by X, Offer Accepted versus Reward and Mailer Type Although we haven’t thoroughly explored this data, thus far we’ve learned that: • Only a small percentage – roughly 5.68 percent – of offers are accepted. • We are missing some data for the Balance columns, but are not missing values for any other variable. • Both of our experimental variables (Reward and Mailer Type) appear to be related to whether or not an offer is accepted. • Two variables, Income Level and Credit Rating, should be coded as Ordinal instead of Nominal.
  • 8. Again, we encourage you to thoroughly explore your data and to investigate and resolve potential data quality issues before building a model. Other tools, such as scatterplots and the Graph Builder, should also be used. 6 While we have only superficially explored the data in this example (as we will also do in future examples), the primary purpose of this exploration is to get to know the variables used in this case study. As such, it is intentionally brief. The Partition Model Dialog Having a good sense of data quality and potential relationships, we now fit a partition model to the data using Analyze > Modeling > Partition, with Offer Accepted as Y, Response and all of the other variables as X, Factor (see Exhibit 5). Exhibit 5 Credit, Partition Dialog Window A Bit About Model Validation When we build a statistical model, there is a risk that the model is overly complicated (overfit), or that the
  • 9. model will not perform well when applied to new data. Model validation (or cross-validation) is often used to protect against over-fitting.2 There are two methods for model validation available from the Partition dialog window in JMP Pro: Specify a Validation Portion or select a Validation column (note that other methods are available from within the platform). In this example, we’ll use a random hold out portion (30 percent) to protect against overfitting (to do so, enter 0.30 in the Validation Portion field). This will assign 70 percent of the records to the training set, which is used to build the model. The remaining 30 percent will be assigned to the hold out validation set, which will be used to see how well the model performs on data not used to build the model. Other Partition Model Dialog Options Two additional options in the dialog window are Informative Missing and Ordinal Restricts Order. These are selected by default. In this example, we have two ordinal predictors, Credit Rating and 2 For background information on model validation and protecting against overfitting, see en.wikipedia.org/wiki/Overfitting. For more information on validation in JMP and JMP Pro, see Building Better Models with JMP Pro, Chapter 6 and Chapter 8, or search for “validation” in the JMP Help.
  • 10. 7 Income Level. We also have missing values for the five balance columns. The Informative Missing option tells JMP to include rows with missing values in the model, and the Ordinal Restricts Order option tells JMP to respect the ordered categories for ordinal variables. For more information on these options, see JMP Help. The completed Partition dialog window is displayed in Exhibit 5. Building the Classification Tree Initial results show the overall breakdown of Offer Accepted (Exhibit 6). Recall that roughly 5.7 percent of offers were accepted. Note: Since a random holdout is used your results from this point forward may be different.3 Below the graph, we see that 12,610 observations are assigned to the training set. These observations will be used to build the model. The remaining 5,390 observations in the validation set will be used to check model performance and to stop tree growth. Note that we have changed some of the default settings: • Since we have a relatively large data set, points were removed from the graph (click on the top red triangle and select Display Options > Show Points).
  • 11. • The response rates and counts are displayed in the tree nodes (select Display Options > Show Split Count from the top red triangle). Exhibit 6 Credit, Partition Initial Window 3 To obtain the same results as shown here, use the Random Seed Reset add-in to set the random seed to 123 before launching the Partition platform. The add-in can be downloaded and installed from the JMP User Community: community.jmp.com/docs/DOC-6601. 8 How Classification Trees Are Formed When building a classification tree, JMP iteratively splits the data based on the values of predictors to form subsets. These subsets form the “branches” of the tree. Each split is made at the predictor value that causes the greatest difference in proportions (for the outcome variable) in the resulting subsets. A measure of the dissimilarity in proportions between the two subsets is the likelihood ratio chi-square statistic and its associated p-value. The lower the p-value, the greater the difference between the groups.
  • 12. When JMP calculates this chi-square statistic in the Partition platform, it is labeled G^2, and the p-value that is calculated is adjusted to account for the number of splits that are being considered. The adjusted p-value is transformed to a log scale using the formula - log10(adjusted p-value). This value is called the LogWorth. The bigger the LogWorth value, the better the split (Sall, 2002). To find the split with the largest difference between subgroups (and the corresponding largest value of LogWorth), we need to consider all possible splits. For each variable, the best split location, or cut point, is determined, and the split with the highest LogWorth is chosen as the optimal split location. JMP reports the G^2 and LogWorth values, along with the best cut points for each variable, under Candidates (use the gray disclosure icon next to Candidates to display). A peek at the candidates in our example indicates that the first split will be on Credit Rating, with Low in one branch and High and Medium in the other (Exhibit 7). Exhibit 7 Credit, Partition Initial Candidate Splits The tree after three splits (click Split three times) is shown in Exhibit 8. Not surprisingly, the model is split on Credit Rating, Reward and Mailer Type. The lowest probability of accepting the offer (0.0196) is Credit Rating(Medium, High) and Reward(Cash Back, Points). The highest probability (0.1473) is Credit Rating(Low) and Mailer Type(Postcard).
  • 13. After each split, the model RSquare (or, Entropy RSquare) updates (this is shown at the top of Exhibit 8). RSquare is a measure of how much variability in the response is being explained by the model. Without a validation set, we can continue to split until the minimum split size is achieved in each branch. (The minimum split size is an option under the top red triangle, which is set to 5 by default.) However, additional splits are not necessarily beneficial and lead to more complex and potentially overfit models. 9 Exhibit 8 Credit, Partition after Three Splits Since we have a validation set, we click Go to automate the tree-building process. When this option is used, the final model will be based on the model with the maximum value of the Validation RSquare statistic. The Split History Report (Exhibit 9) shows how the RSquare value changes for training and validation data after each split (note that the y-axis has been rescaled for illustration). The vertical line is drawn at 15, the number of splits used in the final model. Exhibit 9 Split History, with Maximum Validation R-Square at
  • 14. Fifteen Splits 10 This illustrates both the concept of overfitting and the importance of using validation. With each split, the RSquare for the training data continues to increase. However, after 15 splits, the validation RSquare (the lower line in Exhibit 9) starts to decrease. For the validation set, which was not used to build the model, additional splits are not improving our ability to predict the response. Understanding the Model To summarize which variables are involved in these 15 splits, we turn on Column Contributions (from the top red triangle). This table indicates which variables are most important in terms of the overall contribution to the model (see Exhibit 10). Credit Rating, Mailer Type, Reward and Income Level contribute most to the model. Several variables, including the five balance variables, are not involved in any of the splits. Exhibit 10 Credit, Split History after Fifteen Splits
  • 15. Model Classifications and the Confusion Matrix One overall measure of model accuracy is the Misclassification Rate (select Show Fit Details from the top red triangle). The misclassification rate for our validation data is 0.0573, or 5.73 percent. The numbers behind the misclassification rate can be seen in the confusion matrix (bottom, in Exhibit 11). Here, we focus on the misclassification rate and confusion matrix for the validation data. Since these data were not used in building the model, this approach provides a better indication of how well the model classifies our response, Offer Accepted. There are four possible outcomes in our classification: • An accepted offer is correctly classified as an accepted offer. • An accepted offer is misclassified as not accepted. • An offer that was not accepted is correctly classified as not accepted. • An offer that was not accepted is misclassified as accepted. One observation is that there were few cases wherein the model predicted that the offer would be accepted (see value “2” in the Yes column of the validation confusion matrix in Exhibit 11.) When the target variable is unbalanced (i.e., there are far more observations in one level than in the other), the model that is fit will usually result in probabilities that are small for the underrepresented category.
  • 16. 11 Exhibit 11 Credit, Fit Details With Confusion Matrix In this case, the overall rate of Yes (i.e., offer accepted) is 5.68 percent, which is close to the misclassification rate for this model. However, when we examine the Leaf Report for the fitted model (Exhibit 12), we see that there are branches in the tree that have much richer concentrations of Offer Accepted = Yes than the overall average rate. (Note that results in the Leaf Report are for the training data.) Exhibit 12 Credit, Leaf Report for Fitted Model 12 The model has probabilities of Offer Accepted = Yes in the range [0.0044, 0.6738]. When JMP classifies rows with the model, it uses a default of Prob > 0.5 to make the decision. In this case, only one of the predicted probabilities of Yes is > 0.5, and this one branch (or node) has only 11 observations: 8 yes and 3 no under Response Counts in the bottom table in Exhibit 12.
  • 17. The next highest predicted probability of Offer Accepted = Yes is 0.2056. As a result, all other rows are classified as Offer Accepted = No. The ROC Curve Two additional measures of accuracy used when building classification models are Sensitivity and 1- Specificity. Sensitivity is the true positive rate. In our example, this is the ability of our model to correctly classify Offer Accepted as Yes. The second measure, 1- Specificity, is the false positive rate. In this case, a false positive occurs when an offer was not accepted, but was classified as Yes (accepted). Instead of using the default decision rule of Prob > 0.5, we examine the decision rule Prob > T, where we let the decision threshold T range from 0 to 1. We plot Sensitivity (on the y-axis) versus 1-Specificity (on the x-axis) for each possible threshold value. This creates a Receiver Operating Characteristic (ROC) curve. The ROC curve for our model is displayed in Exhibit 13 (this is a top red triangle option in the Partition report). Exhibit 13 Credit, ROC Curve for Offer Accepted Conceptually, what the ROC curve measures is the ability of the predicted probability formulas to rank an observation. Here, we simply focus on the Yes outcome for the Offer Accepted response variable. We save the probability formula to the data table, and then sort the table from highest to lowest probability. If this probability model can correctly classify the outcomes for Offer Accepted, we would expect to see
  • 18. more Yes response values at the top (where the probability for Yes is highest) than No responses. Similarly, at the bottom of the sorted table, we would expect to see more No than Yes response values. 13 Constructing an ROC Curve What follows is a practical algorithm to quickly draw an ROC curve after the table has been sorted by the predicted probability. Here, we walk through the algorithm for Offer Accepted = Yes, but this is done automatically in JMP for each response category. For each observation in the sorted table, starting at the observation with the highest probability, Offer Accepted = Yes: • If the observed response value is Yes, then a vertical line segment (increasing along the Sensitivity axis) is drawn. The length of the line segment is 1/(total number of Yes responses in the table). • If the observed response value is No, then a horizontal line segment (increasing along the 1- Specificity axis) is drawn. The length of the line segment is 1/(total number of “No” responses in the data table).
  • 19. Simple ROC Curve Examples We use a simple example to illustrate. Suppose we have a data table with only 8 observations. We sort these observations from high to low based on the probability that the Outcome = Yes. The sorted actual response values are Yes, Yes, Yes, No, Yes, Yes, No and Yes. This results in the ROC curve on the left of Exhibit 14. Arrows have been added to show the steps in the ROC curve construction. The first three line segments are drawn up because the first three sorted values have Outcome = Yes. Now, suppose that we have a different probability model that we use to rank the observations, resulting in the sorted outcomes Yes, No, Yes, No, No, Yes, No and Yes. The ROC curve for this situation is shown on the right of Exhibit 14. The first ROC curve moves “up” faster than the second curve. This is an indication that the first model is doing a better job of separating the Yes responses from the No responses based on the predicted probability. Exhibit 14 ROC Curve Examples Referring back to the sample ROC curve in Exhibit 13, we see that JMP has also displayed a diagonal reference line on the chart, which represents the Sensitivity = 1- Specificity line. If a probability model cannot sort the data into the correct response category, then it may be no better than simply sorting at random. In this case, the ROC curve for a “random ranking” model would be similar to this diagonal line. A model that sorts the data perfectly, with all the Yes responses at the top of the sorted table, would have
  • 20. an ROC Curve that goes from the origin of the graph straight up to sensitivity = 1, then straight over to 1- specificity = 1. A model that sorts perfectly can be made into a classifier rule that classifies perfectly; that is, a classifier rule that has a sensitivity of 1.0 and 1-specificity of 0.0. 14 The area under the curve, or AUC (labeled Area in Exhibit 13) is a measure of how well our model sorts the data. The diagonal line, which would represent a random sorting model, has an AUC of 0.5. A perfect sorting model has an AUC of 1.0. The area under the curve for Offer Accepted = Yes is 0.7369 (see Exhibit 13), indicating that the model predicts better than the random sorting model. The Lift Curve Another measure of how well a model can sort outcomes is the model lift. As with the ROC curve, we examine the table that is sorted in descending order of predicted probability. For each sorted row, we calculate the sensitivity and divide that by the proportion of values in the table whereby Offer Accepted = Yes. This value is the model lift. Lift is a measure of how much “richness” in the response we achieve by applying a classification rule to
  • 21. the data. A Lift Curve plots the Lift (on the y-axis) against the Portion (on the x-axis). Again, consider the data table that has been sorted by the predicted probability of a given outcome. As we go down the table from the top to the bottom, portion is the relative position of the row that we are considering. The top 10 percent of rows in the sorted table corresponds to a portion of 0.1, the top 20 percent of rows corresponds to a portion of 0.2, and so on. The lift for Offer Accepted = Yes for a given portion is simply the proportion of Yes responses in this portion, divided by overall proportion of Yes responses in the entire data table. The higher the lift at a given portion, the better our model is at correctly classifying the outcome within this portion. For Offer Accepted = Yes, the lift at Portion = 0.15 is roughly 2.5 (see Exhibit 15). This means that in rows in the data table corresponding to the top 15% of the model’s predicted probabilities, the number of actual Yes outcomes is 2.5 times higher than we would expect if we had chosen 15 percent of rows from the data set at random. If the model does not sort the data well, then the lift will hover at around 1.0 across all of portion values. Exhibit 15 Lift Curve for Offer Accepted Lift provides another measure of how good our model is at classifying outcomes; it is particularly useful when the overall predicted probabilities are lower than 0.5 for the outcome that we wish to predict.
  • 22. 15 Though in this example the majority of the predicted probabilities of Offer Accepted = Yes were less than 0.2, the lift curve indicates that there are threshold values that we could use with the predicted probability model to create a classifier rule that will be better than guessing at random. This rule can be used to identify portions of our data that contain a much richer number of customers who are likely to accept an offer. For categorical response models, the misclassification rate, confusion matrix, ROC curve and lift curve all provide measures of model accuracy; each of these should be used to assess the quality of the prediction model. Summary Statistical Insights In this case study, a classification tree was used to predict the probability of an outcome based on a set of predictor variables using the Partition platform. If the response variable is continuous rather than categorical, then a regression tree can be used to predict the mean of the response. Construction of regression trees is analogous to construction of classification trees, however splits are based on the mean response value rather than the probability of outcome categories.
  • 23. Implications This model was created for explanatory rather than predictive purposes. Our goal was to understand the characteristics of customers most likely to accept a credit card offer. In a predictive model, we are more interested in creating a model that accurately predicts the response (i.e., predicts future customer behavior) than we are in identifying important variables or characteristics. JMP Features and Hints In this case study we used the Columns Viewer and Distribution Platforms to explore variables one at a time, utilizing Fit Y by X to explore the relationship between our response (or target variable) and predictor variables. This exploratory work was only partially completed herein. We weighed the possibility of using the misclassification rate and confusion matrix as overall measures of model accuracy, and introduced the ROC and lift curves as additional measures of accuracy. As discussed, ROC and lift curves are particularly useful in cases where the probability of the target response category is low. Note: The cutoff for classification used throughout JMP is 0.50. In some modeling situations it may be desirable to change the cutoff of classification (say, when the probability of response is extremely low). This effect can be achieved manually by saving the prediction formula to the data table, and
  • 24. then creating a new formula column that classifies the outcome based on a specified cutoff. In the sample formula below, JMP will classify an outcome as “Yes” if the predicted probability of survival is <= 0.30. The Tabulate platform (under Analyze) can then be used to manually create a confusion matrix. An add-in from the JMP User Community can also be used to change the cut-off for classification (community.jmp.com/docs/DOC-6901). This add-in allows the user to enter a range of values for the 16 cutoff, and produces confusion matrices for each cutoff value. The goal is to find a cutoff that minimizes the misclassification rate on the validation set. Exercises Exercise 1: Use the Credit Card Marketing BBM.jmp data set to answer the following questions: a. The Column Contributions output after 15 splits is shown in Exhibit 10. Interpret this output. How can this information be used by the company?
  • 25. What is the potential value of identifying these characteristics? b. Recreate the output shown in Exhibits 9-11, but instead use the split button to manually split. Create a classification tree with 25 splits. c. How did the Column Contributions report change? d. How did the Misclassification Rate and Confusion Matrix change? How did the ROC or Lift Curves change? Did these additional splits provide any additional (useful) information? e. Why is this an exploratory model rather than a predictive model? Describe the difference between exploratory and predictive models. Exercise 2: Use the Titanic Passengers.jmp data set in the JMP Sample Data Library (under the Help menu) for this exercise. This data table describes the survival status of 1,309 of the 1,324 individual passengers on the Titanic. Information on the 899 crew members is not included. Some of the variables are described below: Name: Passenger Name Survived: Yes or No Passenger Class: 1, 2, or 3 corresponding to 1st, 2nd or 3rd class Sex: Passenger sex Age: Passenger age Siblings and Spouses: The number of siblings and spouses
  • 26. aboard Parents and Children: The number of parents and children aboard Fare: The passenger fare Port: Port of embarkment (C = Cherbourg; Q = Queenstown; S = Southampton) Home/Destination: The home or final intended destination of the passenger Build a classification tree for Survived by determining which variables to include as predictors. Do not use model validation for this exercise. Use Column Contributions and Split History to determine the optimal number of splits. a. Which variables, if any, did you choose not to include in the model? Why? b. How many splits are in your final tree? c. Which variables are the largest contributors? d. What is your final model? Save the prediction formula for this model to the data table (we will refer to it in the next exercise). 17 e. What is the misclassification rate for this model? Is the model better at predicting survival or non-survival? Explain.
  • 27. f. What is the area under the ROC curve for Survived? Interpret this value. Does the model do a better job of classifying survival than a random model? g. What is the lift for the model at portion = 0.1 and at portion = 0.25? Interpret these values. Exercise 3: Use the Titanic Passengers.jmp data set for this exercise. Use the Fit Model platform to create a logistic regression model for Survived? using the other variables as predictors. Include interaction terms you think might be meaningful or significant in predicting the probability of survival. For information on fitting logistic regression models, see the guide and video at community.jmp.com/docs/DOC-6794. a. Which variables are significant in predicting the probability of survival? Are any of the interaction terms significant? b. What is the misclassification rate for your final the logistic model? c. Compare the misclassification rates for the logistic model and the partition model created in Exercise 2. Which model is better? Why? d. Compare this model to the model produced using a classification tree. Which model would be easier to explain to a non-technical person? Why?
  • 28. 18 SAS Institute Inc. World Headquarters +1 919 677 8000 JMP is a software solution from SAS. To learn more about SAS, visit www.sas.com For JMP sales in the US and Canada, call 877 594 6567 or go to www.jmp.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. S81971.1111 Project PlannerProject PlannerSelect a period to highlight at right. A legend describing the charting follows. Period Highlight:1Plan DurationActual Start% CompleteActual (beyond plan)% Complete (beyond plan)ACTIVITYPLAN STARTPLAN DURATIONACTUAL STARTACTUAL DURATIONPERCENT COMPLETEPERIODS1234567891011121314151617181920212 22324252627282930313233343536373839404142434445464748 495051525354555657585960Activity 01151425%Activity 021616100%Activity 03242535%Activity 04484610%Activity 05424885%Activity 06434685%Activity 07545350%Activity 08525560%Activity 09525675%Activity 106567100%Activity
  • 29. 11615860%Activity 1293930%Activity 13969750%Activity 1493910%Activity 1594851%Activity 1610510380%Activity 171121150%Activity 181261270%Activity 191211250%Activity 201451460%Activity 2114814244%Activity 221471430%Activity 2315415812%Activity 241551535%Activity 251581550%Activity 261628163050% Case 2 - Baggage Complaints: Descriptive Statistics and Time Series Plots Marlene Smith, University of Colorado Denver Business School 2 Baggage Complaints: Descriptive Statistics and Time Series Plots Background Anyone who travels by air knows that occasional problems are
  • 30. inevitable. Flights can be delayed or cancelled due to weather conditions, mechanical problems, or labor strikes, and baggage can be lost, delayed, damaged, or pilfered. Given that many airlines are now charging for bags, issues with baggage are particularly annoying. Baggage problems can have a serious impact on customer loyalty, and can be costly to the airlines (airlines often have to deliver bags). Air carriers report flight delays, cancellations, overbookings, late arrivals, baggage complaints, and other operating statistics to the U.S. government, which compiles the data and reports it to the public. The Task Do some airlines do a better job of handling baggage? Compare the baggage complaints for three airlines: American Eagle, Hawaiian, and United. Which airline has the best record? The worst? Are complaints getting better or worse over time? Are there other factors, such as destinations, seasonal effects or the volume of travelers that affect baggage performance? The Data Baggage Complaints.jmp The data set contains monthly observations from 2004 to 2010 for United Airlines, American Eagle, and Hawaiian Airlines. The variables in the data set include: Baggage The total number of passenger complaints for theft of baggage contents, or for lost, damaged, or misrouted luggage for the airline that month Scheduled The total number of flights scheduled by that airline
  • 31. that month Cancelled The total number of flights cancelled by that airline that month Enplaned The total number of passengers who boarded a plane with the airline that month These data are available from the U.S. Department of Transportation, “Air Travel Consumer Report,” the Office of Aviation Enforcement and Proceedings, Aviation Consumer Protection Division (http://airconsumer.dot.gov/reports/index.htm). The data for baggage complaints and enplaned passengers cover domestic travel only. 3 Analysis We start by exploring baggage complaints over time. Exhibit 1 shows the time series plot for the variable Baggage by Date for each of the airlines. United Airlines has the most complaints about mishandled baggage in almost all of the months in the data set; Hawaiian Airlines has the fewest number of complaints in all months. Do we conclude, then, that United Airlines has the “worst record” for mishandled baggage and Hawaiian, the best?
  • 32. Exhibit 1 Time Series Plots of Baggage by Airline (Graph > Graph Builder; drag and drop Baggage in Y, Date in X and Airline in Overlay. Click on the smoother icon at the top to remove the smoother, and click on the line icon. Or, right-click in the graph, and select Smoother > Remove, and Points > Change to > Line. Then, click Done.) United Airlines is a much bigger airline, as evidenced by Exhibit 2, which shows the average number of scheduled flights and enplaned passengers by airline. United handles more than three times the number of passengers than American Eagle on average, and almost eight times more than Hawaiian. Thus, United has more opportunities to mishandle luggage because it handles more luggage – it’s simply a much bigger airline. Exhibit 2 Average Scheduled Flights and Enplaned Passengers by Airline (Analyze > Tabulate; drag Airline in drop zone for rows, and Scheduled and Enplaned in the drop zone for column as analysis columns. Then, drag Mean from the middle panel to the middle of the table. Note that in JMP versions 10 and earlier Tabulate is under the Tables menu.)
  • 33. 4 To adjust for size we calculate the rate of baggage complaints (Exhibit 3): Baggage % = 100 × (Baggage/Enplaned). Exhibit 3 Calculating Baggage % (Create a new column in the data table, and rename it Baggage %. Right click on the column header, and select Formula to open the Formula Editor. To create the formula: 1. Type 100 2. Select multiply on the keypad 3. Select Baggage from the columns list 4. Select divide by on the keypad 5. Select Enplaned from the columns list 6. Click OK.) In Exhibit 4 we compare the records of the three airlines using Baggage %. We see that American Eagle has the highest rate of baggage complaints when adjusted for number of enplaned passengers. Exhibit 4 Average Baggage % by Airline
  • 34. Plotting the Baggage % on a time series plot allows us to see changes over time. In Exhibit 5 we see that baggage complaint rates from American Eagle and United passengers increased through 2006 and began declining thereafter. 5 Exhibit 5 Time Series Plots of Baggage % by Airline The time series for Hawaiian passengers in Exhibit 5 is relatively flat compared to American Eagle and United, so it’s difficult to detect a pattern over time. In Exhibit 6 we isolate the data for the Hawaiian flights. Complaint rates for Hawaiian passengers began to drop in the summer of 2008 until fall of 2010, after which the rate of complaints returned to historical levels. Exhibit 6 Data Filter and Time Series Plot of Baggage % - Hawaiian Airline
  • 35. (Rows > Data Filter; select Airline and click Add. Then, select Hawaiian and check the Show and Include boxes.) 6 Let’s return to Exhibit 5: Do we see any other patterns in baggage complaint rates over time? The pattern of spikes and dips indicates that changes in the rate of baggage complaints may have a seasonal component. In Exhibit 7 we plot the average Baggage % by month for the three airlines to investigate this further. Average rates are highest in December and January and in the summer months, and lowest in the spring and late fall. Interestingly, all three airlines seem to follow the same general pattern. (Using the Data Filter to zoom in on Hawaiian will make this more apparent.) Exhibit 7 Time Series Plots of Monthly Average Baggage % by Airline (Graph > Graph Builder; drag and drop Baggage % in Y, Month in X and Airline in Group X. Then, click on the line icon at the top. Or, right-click in the graph, and select Points > Change to >
  • 36. Line. Then, click Done.) Summary Statistical Insights It is important to ask a lot of pointed questions about the purpose and scope of a project before jumping into data analysis. How will the results be used? And, is the information available to answer the questions being asked? The data set provided did not have information on flight destinations, so we are not able to investigate whether destination is related to baggage issues. To provide a fair comparison of performance across airlines, it’s best to standardize for differences in volume. It is also important to pay close attention to units of measurement, as will be seen in an exercise. Managerial Implications When asking a data analyst to conduct a study, ensure that the purpose is clear. What, specifically, do you want to know? And, how will the information provided by the analyst be used to guide the decision-making process? What did we learn from the analysis? Managers at American Eagle should note the relatively high rate of baggage complaints and consider costs and benefits of improvements. All three airlines
  • 37. 7 should anticipate high rates of baggage complaints in January, February and the summer months and should plan accordingly. If we are interested in studying baggage complaints for different destinations, additional data are required. JMP Features and Hints This case uses the Graph Builder to compare multiple time series and Tabulate to produce summary statistics by groups. Data Filter is used with the Graph Builder to further investigate a particular group. Exercises 1. Exhibit 4 shows that the average Baggage % for American Eagle is 1.033. Provide a nontechnical interpretation of this number by using it in this sentence: “1.033 is the…” Pay particular attention to units of measurement. 2. Use Tabulate to calculate the standard deviations of Baggage % for each airline. Provide a non- technical interpretation of the standard deviations. Is one airline more variable than the others? 3. Compare the three airlines based on cancelled flights: Which
  • 38. airline has the best record? The worst? 8 SAS Institute Inc. World Headquarters +1 919 677 8000 JMP is a software solution from SAS. To learn more about SAS, visit www.sas.com For JMP sales in the US and Canada, call 877 594 6567 or go to www.jmp.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. S81971.1111 Case 8 - Contributions: Simple Linear Regression and Time
  • 39. Series Marlene Smith, University of Colorado Denver Business School 2 Contributions1: Simple Linear Regression and Time Series Background The Colorado Combined Campaign solicits Colorado government employees’ participation in a fund- raising drive. Funds raised by the campaign go to over 700 Colorado charities in all, including the Humane Society of Boulder Valley and the Denver Children’s Advocacy Center. Prominent state employees, such as university presidents, chancellors and lieutenant governors, head the annual campaigns. An advisory committee determines whether the charities receiving contributions provide the services claimed in a fiscally responsible manner. All Colorado state employees may contribute to the fund. However, certain state institutions are targeted to receive promotional brochures and campaign literature. Employees in these targeted groups are
  • 40. referred to as “eligible” employees. Each year, the number of eligible employees is known in June. Fund-raising activities are then conducted throughout the fall. By year’s end, total contributions raised that year are tabulated. The Task It is now June 2010. The number of eligible employees for 2010 has been determined to be 53,455. Does knowing the number of eligible employees help predict 2010 year-end contributions? The Data Contributions.jmp This is an annual time-series from 1988 – 2009. The variables are contribution Year and: Actual Total contributions to the campaign for the year in dollars Employees Number of eligible employees that year Analysis The average level of contributions during this time period was $1,143,769, with a typical fluctuation of $339,788 around the average. The average number of eligible employees was 45,419, with a typical fluctuation of 9,791. Exhibit 1 Summary Statistics for Actual and Employees (Analyze > Tabulate; drag Actual and Employees in drop zone for rows as analysis columns. Then, drag Mean and Std Dev from the middle panel to drop zone for columns.
  • 41. Note that in JMP versions 10 and earlier Tabulate is under the Tables menu.) 1Mel Rael, Executive Director of the Colorado Combined Campaign, graciously provided these data. 3 As we can see in Exhibit 2, contributions are growing over time: Exhibit 2 Time Series Plot of Actual by Year (Graph > Graph Builder; drag and drop Actual in Y and Year in X. Click on the smoother icon at the top to remove the smoother. Hold the shift key and click the line icon to add a line. Or, right click in the graph to select these options. Then, click Done.) The long-term growth in contributions is attributable to two phenomena: • The amount contributed per eligible employee is mostly upward (Exhibit 3, top). • The number of eligible employees is on the rise, particularly in the 1999 to 2002 campaign years (Exhibit 3, bottom).
  • 42. Exhibit 3 Time Series Plots of Actual per Employee and Employees (Create a new column and rename it Actual per Employee, then use the Formula Editor to create the formula – Actual divided by Employees. Follow the instructions for Exhibit 2 to create the graph for Employees. Then, click and drag Actual per Employee above Employees in Y, and release. To change the markers for points, use the lasso from the toolbar to select the points (draw a circle around them). Then go to Rows > Markers and select a marker.) 4 The scatterplot and least squares regression line using Actual as the response variable and Employees as the predictor variable is shown in Exhibit 4. The formula for the regression line is found below the plot
  • 43. under Linear Fit. The slope of the fitted line, 33.555, estimates the contribution for each eligible employee over this time period. Hence, the model estimates an additional $33.56 in contributions for each eligible employee. Under Parameter Estimates, we see that the number of employees is a statistically significant predictor of year-end contributions; the p-value, listed as Prob > |t|, is < 0.0001. The number of employees doesn’t perfectly predict contributions. Just over 93% of the variability in contributions is associated with variability in number of eligible employees (RSquare = 0.934907). Comparing the standard deviation of Actual ($339,788) to the root mean square of the regression equation ((RMSE = $88,832) suggests that a substantial reduction in the variation in contributions occurs by using the regression model to explain variation in year-end contributions. Exhibit 4 Regression with Actual (Y) and Employees (X) (Analyze > Fit Y by X. Use Actual as Y, Response and Employees as X, Factor. Under the red triangle select Fit Line. Note: To remove the markers in Exhibit 3, go to the Rows menu and select Clear Row States.)
  • 44. We’ve been informed that the number of eligible employees in 2010 is 53,455. To use the regression equation to forecast 2010 year-end contributions, we can plug this number into the regression equation. 5 If the number of Employees is 53,455, the predicted Actual contributions is: Actual = -380265.5 + (33.555042) x Employees = -380265.5 + (33.555042) x (53,455) = 1413419.3 (or, $1,413,419) In words, given that the number of eligible employees is 53,455, our model estimates that 2010 year-end contributions will be approximately $1.413 million. Easier still, we can skip the math exercise, save the regression formula and prediction intervals and ask JMP to calculate the estimated contributions for 2010 (Exhibit 5). Prediction intervals are useful, since the number of employees isn’t a perfect predictor of contributions. The prediction interval gives us an estimate of the interval in which the 2010 year-end
  • 45. contributions will fall (with 95% confidence). Exhibit 5 Predicted Value and Prediction Interval for 2010 Contribution (In the Bivariate Fit window, select Save Predicteds under the red Triangle for Linear Fit. JMP will create a new column with the prediction formula for Actual. Create a new row and enter a value for Employees – the predicted value for Actual will display. To save prediction intervals, use Analyze > Fit Model; select Actual as Y and Employees as a model effect, and hit Run. Under the red triangle select Save Columns > Indiv Confidence Intervals.) Predicted values can also be explored dynamically using the cross-hair tool. In Exhibit 6, we see that the predicted value for Actual, if Employees is 53,414, is around $1.402 million. Exhibit 6 Using Cross-hair Tool to Explore Predicted Contribution (Select the cross-hair tool on the toolbar. Click on the regression line at the value of the predictor to see the predicted response value.)
  • 46. 6 We can also graphically explore prediction intervals (Exhibit 7). Exhibit 7 Prediction Intervals for Actual (In the Bivariate Fit window, select Confid Curves Indiv under the red Triangle next to Linear Fit. Use the cross-hairs to find the upper and lower bounds for the prediction interval.) Summary Statistical Insights Forecasting using regression involves substituting known or hypothetical values for X into the regression equation and solving for Y. In this case, values for the predictor variable in the forecasting horizon are known in advance; i.e., we know that the 2010 value for Employees is 53,455, so we plugged this value into the regression equation to forecast year- end contributions. In another setting,
  • 47. in which the same-year value for X is unknown, how would we proceed? One possibility is to forecast the value of the predictor variable. Another possibility, when theoretically and statistically justified, is to use lagged values of the original predictor variables in the regression model. When building any regression model, residuals should be checked to ensure that the linear fit makes sense. Managerial Implications Regression has provided a prediction for year-end 2010 Colorado Combined Campaign contributions of $1.4M. In managerial settings such as this, where the response variable represents a business goal, managers often set higher expectations than the predicated value to motivate improved performance. One such choice here might be the upper 95% prediction limit of $1.6M. This forecasting methodology can be repeated year after year. Once the final contributions to 2010 are known, they can be added to the data set and the regression line can be recalculated. By midyear of 2011, the number of eligible employees will be known. 7
  • 48. Note that, in this case, we focused on trend analysis using only Year as the predictor. We could also fit a model with both Employee and Year. We will consider regression models with more than one predictor in a future case. JMP Features and Hints In this case we used Fit Y by X to develop a regression model. We used cross-hairs tool to explore the predicted value of the response at a given value of the predictor. Several options, such as saving predicted values and showing prediction intervals, are available under the red triangle for the fitted line. When the prediction formula is saved, a new column with the regression formula is created. Enter the value of X in a new row in the JMP data table, and the predicted value will display. To save prediction intervals to the data table for the value of X, use Fit Model. Note that other intervals and model diagnostics are also available from both Fit Y by X and Fit Model. To generate residual plots from within Fit Y by X, select the option under the red triangle next to Linear Fit. Exercises A regression trend analysis uses only the information contained in the passage of time to predict a response variable. 1. Perform a trend analysis with the Colorado Combined
  • 49. Campaign data, using Actual as the response variable and Year as the predictor. 2. Forecast the 2010 - 2013 Colorado Combined Campaign contributions. 3. Compare your forecast for 2010 with that obtained from the simple linear regression model in which number of eligible employees is the predictor variable. Hint: Compare RMSE, RSquare, and the estimated contributions for 2010. Which model does a better job of explaining variation in contributions? 4. We’ve limited our analyses to one predictor variable at a time. Guestimate what would happen, in terms of RMSE, RSquare and model predictions if we were to build a model with both Year and Employees. 8 SAS Institute Inc. World Headquarters +1 919 677 8000 JMP is a software solution from SAS. To learn more about SAS, visit www.sas.com For JMP sales in the US and Canada, call 877 594 6567 or go to
  • 50. www.jmp.com SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. S81971.1111