Building a Regression Model using SPSS

Zac Bodner
Final Lab Assignment

In class this semester, we have already explored regression for explanatory purposes. For
example, we previously built a regression model to explain what effects certain independent
variables (trustworthiness, intelligence, “like me-ness”) have on certain dependent variables (I
have a good opinion of) in political candidates. These models were helpful in determining how
different candidates could employ certain tactics to gain favor (and hopefully votes) from voters.
In this regard, regression models are effective tools in explaining the reasons behind certain
phenomena - like why certain candidates do well with certain populations, or what factors make
us love ice cream - but their real power comes in how capable these models are of predicting
certain outcomes. For example, if we have an explanatory regression model that identifies
certain characteristics (variables) shared by both customers and non-customers in a particular
data set - could we then turn around and use that model to identify additional, probable
customers (and non-customers) in another, independent data set (and make tons of cash for our
employers in the process)?
We can and we will!
The following paper presents a step by step explanation of how to do this, based on our final lab
assignment of the semester.
The first thing we must do is divide the current data set (Customer, we’ve been using it all
semester) in half. This will allow us to confirm our model’s findings in one half of the data set on
the other. In other words, this will help us prove that our model will not only work on the data set
that we are testing, but in other, unrelated (random) data sets, as well.
The key to dividing a data set into equal, random halves, is to confirm that both sides are
distributed evenly on a number of variables. We have already divided the data set using SPSS
in a previous assignment, and confirmed the randomness and equality of both halves by
examining the distributions of some of these variables in a CROSSTABS setting. If the
differences in the distributions of these variables are very small, (fractions of a percent) then we
are good to go. From the previous assignment, we can confirm that our data set is split into two
randomized and equal halves.
From here, we must calculate some of the interactions between variables that we found in
another previous assignment - the CHAID segmentation. CHAID stands for Chi Square
Automatic Interaction Detector. It produces a tree that shows which variables contain the largest
segments of customers, and continues on by further dividing each segment. For example - the
tree starts by segmenting customers from non-customers. Then, it segments the customers
further by say, the market value of their home. Then it continues, by separating this market
value segment into each gender. By doing this, we can examine the percentage of customers in
each segment, and compare their concentration to the rest of the data set for targeting
purposes.
By identifying these interactions, we can make new variables to add to our model. This is what
we will do here. We do this because, even though CHAID is a great tool for segmenting the data
set, we are more interested in seeing the total, combined interactions. For this purpose, a
regression will always be the better option. To turn these segmentation interactions into
variables, we simply multiply some of our segments that demonstrated high concentrations of
customers. Here is a screenshot of a few that I used:

These interactions are computed, and added to the list of our data set’s independent variables.
From here, we can add them and all of our other independent variables to a step-wise
regression. First, we must make sure to select which half of the data-set we want to test. We will
test Half 1 - by selecting it in SPSS.
A step-wise regression is valuable for differentiating significant predictor (independent) variables
from insignificant ones. All you do is throw the kitchen sink (all of your variables) in, and SPSS
will find the ones with the strongest beta coefficients (relationships to our outcome variable) and
order them from highest to lowest in the model. For purposes of orderliness and ease of use, we
always want our models to be parsimonious - meaning they have as few variables as possible
while still making good predictions.
This being the case, I chose the first eleven variables the step-wise regression returned. You will
know when to stop adding variables by how much the total R-Square value increases. The R-
Square (and adjusted R-square - for multiple variables) is a measure of how much of the
variance in the outcome variable the independent variables in the model explain. If you have ten
variables that have an adjusted R-square of .159, or fifteen variables with a .164 - it’s best to
just use the ten variables, because each additional variable isn’t explaining much variance at
this point.
We’re cooking with gas now!
We have eleven solid, significant variables that we can now throw into a regular regression,
signified by switching the “Stepwise” option to the “Enter” option in SPSS. It is important to note
that for this assignment, we need to check the option to “replace missing values with the mean”
in SPSS.
This means that if any of the single members of our data population have missing values, we
will replace those missing values with the mean for that data point, instead of tossing the
member altogether. This way, we do not detract or add anything from the model, but we don’t
have to waste data. Luckily for us, of the eleven variables that made it into our model - there
were no missing data.
Now, we will input the variables into the “enter” regression, and save the output as a variable.
We will call the variable PREDICT, because we will use it later to predict, based on our
observations of customers and non-customers in this data-set, the likelihood of finding
additional customers among separate data-sets.

We will now divide this output into deciles, since we are concerned primarily with our model’s
capability of finding prospects based on how much they resemble the observed customers of
this data-set. SPSS does this fairly easily by going to Transform and selecting Rank Cases. We
then save that output as DECILES, and using CROSSTABS, we can compare our observed
customers with our ranked predictions. Here is a screenshot of this:
This is great; our prediction works. We would hope that the highest deciles (1) have the highest
concentration of customers, and vice-versa with the non-customers coming from the lowest
deciles (10). As we can see, it does. In the first decile, we have an 8.9% higher concentration of
customers than in the rest of the data-set.
We can take those odds to Vegas!
Now, we have to test this model on the other half, Half 2. To do this, we have to calculate a
score for our output that we can apply to the second half, but before we do this - we must
confirm the score we calculate correctly interprets the regression output that we have.

This is pretty easy, we go into Transform > Compute and enter in an equation based on our
variables in the model:
Our ﬁnal score then looks like this, and we save it as the another variable in the set, SCORE:
We then compare this variable SCORE to PREDICT, and luckily for us - they are almost
identical. This means our score calculation is a correct interpretation of our regression model,
and can now be applied to an independent sample (data-set).
To get to our simulated, independent data set - we now select Half 2. With Half 2 active, we run
the code we just made, and then transform the output again into deciles. We will save these
deciles as DECILES2, and run the same CROSSTABS function as we did before.
If we have done our job correctly, this Crosstabs will look almost identical, and hopefully a little
better than the ﬁrst one. Drumroll!!!!

Hallelujah! This crosstabs has a slightly higher percentage of customers in the ﬁrst decile, but is
practically the same. Check out the comparison on the next page. Our predictive model works!

This is great news. We built a regression model and ran i on one data-set. If the model worked,
we would expect similar results by running the same model on a different data-set. We got those
similar results.
With these techniques and a handy tool like SPSS, we can take a data set, build a regression
model to find characteristics shared by customers and non-customers - then validate this model
on an independent data set. By doing this, we can significantly increase our chances of finding
new customers anywhere, which is obviously an incredibly valuable skill to have.
But remember, anyone can input numbers into SPSS. The real difference between a good and
great market researcher is being able to interpret those numbers, by asking good questions and
employing impeccable language!

Building a Regression Model using SPSS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building a Regression Model using SPSS

Similar to Building a Regression Model using SPSS (20)

More from Zac Bodner

More from Zac Bodner (16)

Recently uploaded

Recently uploaded (20)

Building a Regression Model using SPSS