Estimating Models Using Dummy Variables
You have had plenty of opportunity to interpret coefficients for metric variables in regression models. Using and interpreting categorical variables takes just a little bit of extra practice. In this Discussion, you will have the opportunity to practice how to recode categorical variables so they can be used in a regression model and how to properly interpret the coefficients. Additionally, you will gain some practice in running diagnostics and identifying any potential problems with the model.
To prepare for this Discussion:
Review Warner’s Chapter 12 and Chapter 2 of the Wagner course text and the media program found in this week’s Learning Resources and consider the use of dummy variables.
Create a research question using the General Social Survey dataset that can be answered by multiple regression. Using the SPSS software, choose a categorical variable to dummy code as one of your predictor variables.
Estimate a multiple regression model that answers your research question. Post your response to the following:
What is your research question?
Interpret the coefficients for the model, specifically commenting on the dummy variable.
Run diagnostics for the regression model. Does the model meet all of the assumptions? Be sure and comment on what assumptions were not met and the possible implications. Is there any possible remedy for one the assumption violations?
Be sure to support your Main Post and Response Post with reference to the week’s Learning Resources and other scholarly evidence in APA Style.
Regression Diagnostics and Model Evaluation
Regression Diagnostics and Model Evaluation
Program Transcript
[MUSIC PLAYING]
MATT JONES: We've gone over estimating bivariate and multiple regression
models, but one thing we haven't talked about up to this point are some of the
assumptions of multiple regression models. It's very important to adhere to these
assumptions to have proper interpretation of our models. These assumptions
include linearity, independence of error, homoscedasticity, multicollinearity,
undue influence, and normal distribution of errors. Let's go back to SPSS to see
how we can test these assumptions and evaluate our models.
Let's go ahead and estimate a multiple regression model using respondent's
socioeconomic status index is the dependent variable, respondent's highest
education as an independent variable, and occupational prestige score as an
independent variable. But this time, let's request some additional information to
perform some diagnostics around our model.
Go to analyze, regression, and linear, since we are still using an ordinary least
squares method. We'll scroll down and enter my dependent variable first,
respondent socioeconomic index. My independent variables of occupational
prestige and highest year of school completed. I want to go over to statistics and
request some additional information. I will request collinearity ...
Estimating Models Using Dummy VariablesYou have had plenty of op.docx
1. Estimating Models Using Dummy Variables
You have had plenty of opportunity to interpret coefficients for
metric variables in regression models. Using and interpreting
categorical variables takes just a little bit of extra practice. In
this Discussion, you will have the opportunity to practice how
to recode categorical variables so they can be used in a
regression model and how to properly interpret the coefficients.
Additionally, you will gain some practice in running diagnostics
and identifying any potential problems with the model.
To prepare for this Discussion:
Review Warner’s Chapter 12 and Chapter 2 of the Wagner
course text and the media program found in this week’s
Learning Resources and consider the use of dummy variables.
Create a research question using the General Social Survey
dataset that can be answered by multiple regression. Using the
SPSS software, choose a categorical variable to dummy code as
one of your predictor variables.
Estimate a multiple regression model that answers your research
question. Post your response to the following:
What is your research question?
Interpret the coefficients for the model, specifically
commenting on the dummy variable.
Run diagnostics for the regression model. Does the model meet
all of the assumptions? Be sure and comment on what
assumptions were not met and the possible implications. Is there
any possible remedy for one the assumption violations?
Be sure to support your Main Post and Response Post with
reference to the week’s Learning Resources and other scholarly
evidence in APA Style.
2. Regression Diagnostics and Model Evaluation
Regression Diagnostics and Model Evaluation
Program Transcript
[MUSIC PLAYING]
MATT JONES: We've gone over estimating bivariate and
multiple regression
models, but one thing we haven't talked about up to this point
are some of the
assumptions of multiple regression models. It's very important
to adhere to these
assumptions to have proper interpretation of our models. These
assumptions
3. include linearity, independence of error, homoscedasticity,
multicollinearity,
undue influence, and normal distribution of errors. Let's go
back to SPSS to see
how we can test these assumptions and evaluate our models.
Let's go ahead and estimate a multiple regression model using
respondent's
socioeconomic status index is the dependent variable,
respondent's highest
education as an independent variable, and occupational prestige
score as an
independent variable. But this time, let's request some
additional information to
perform some diagnostics around our model.
Go to analyze, regression, and linear, since we are still using an
ordinary least
squares method. We'll scroll down and enter my dependent
variable first,
respondent socioeconomic index. My independent variables of
occupational
prestige and highest year of school completed. I want to go over
to statistics and
request some additional information. I will request collinearity
diagnostics, and
Durbin-Watson of the residuals.
Click continue. And we'll also click on plots and request a plot,
my predicted
against my residuals. I enter those and click continue. Lastly, I
want to click on
Save and request Cook's-- this is also called Cook's distance--
and my
standardized residuals. Click continue. And once I click OK, the
model will be
5. Regression Diagnostics and Model Evaluation
As a general rule, values close to 10 and definitely above 10
indicate serious
multicollinearity in the model. That means the independent
variables have a high
level of correlation between each other. We see here that the
value of 1.4, for
both of our predictor variables, are well below that 10.0 general
rule. Therefore,
we can assume that we've met the assumption.
We requested a Cook's distance, which tells us something about
undue influence
that is specific outliers one or the variables that might be
causing undue influence
on the model. They might have a significant impact. We can go
to our Cook's
distance and look at the descriptives on our residual statistics.
Again, as a general rule, Cook's distance values of 1.0 or
greater are considered
problematic and further diagnostics should be performed to
6. evaluate for possible
undue influence on the model. We see here that our Cook's
distance values
range from a minimum of 0.0 to 0.025, well below the general
rule of 1.0. We can
assume that we have no undue influence in this model.
After examining the Cook's distance, we can examine a
histogram of the
distribution of our errors. The assumption on multiple
regression is the normal
distribution of errors. As you can see from our histogram, our
distribution is fairly
normal. Therefore, we can conclude that we have met this
assumption. Or, at the
least, we do not have a significant deviation from normality. I
should note that
many modern statisticians see this assumption as of little
importance to
estimating regression models as it has little impact on the
model.
Next, let's look at the scatter plot which provides us with
information about
homoscedasticity, or whether our residuals at each level the
predictor are equal
in variance. As we can see here, there is no discernible pattern
with the spread
of scatter. If our model suffered from heteroscedasticity, we
would see a
grouping of scatter at one end that funnels out into a discernible
pattern, often
looking like a trumpet.
If we double click our scatterplot, we obtain the chart editor.
And I'm just going to
9. Dummy Variables
Program Transcript
DR. MATT JONES: Hi everybody, this is Dr. Matt Jones from
the Center for
Research Quality here to talk to you today about constructing
dummy variables in
SPSS. The purpose behind our conversation today is to show
you how to
construct these dummy variables to use as independent variables
when you are
fitting a multiple regression model or constructing a multiple
regression model,
And I have in front of us the Afro barometer data set. I've
greatly simplified it for
the purpose of this demonstration. You'll see, there are
obviously only three
variables in it, country in alphabetical order, country by region,
and trust in
government index. Now I might want to construct a variable or
use a variable,
country by region, that that might be relevant to my research
question or might
be an important controlling variable that I need to use in my
multiple regression
analysis. And it's very tempting just to throw it in as an
independent variable as it
is here.
SPSS will allow me to do that. It will produce some output for
that variable or
coefficient so forth and associated p values. But the statistics
generated really
won't necessarily make any sense unless I'm creating a dummy
10. variable or set of
dummy variables from this original variable. So if I go and
click on values here,
you'll see that there are five 4 attributes or four groups to this
variable country by
region, West Africa, East Africa, Southern Africa, and North
Africa.
And the rule is for creating dummy variables is the number of
groups minus 1.
That is there four groups here, four attributes to this variable.
So I need to take 4
minus 1, obviously equals 3. I need to create three dummy
variables. One
variable is always left out if you will to service that reference
category. And just
again for the sake of simplicity, I'm going to leave number four,
North Africa, as
our reference category today.
What you pick as your reference category might be dependent
upon your
research question, some theory, what it is you're trying to find
out, again, very
context specific. But, again, for today, just going to sort of
randomly choose North
Africa as our reference category. Before you create your dummy
variable, you do
want to note the original coding on the original variable here
one, two, three, and
four and what those correspond to.
So let's go ahead and move out of here and create these
variables. Transform,
recode into different variables, this is how we're going to start
our process of
12. Dummy Variables
And for our label, we'll do the same. So I have to give it old and
new values. So
the old value for West Africa was 1, and the new value is going
to be 1. And for
dummy variables, they have to take on the attribute value of
either 1 or 0. 1 it has
a attribute, or 0 it doesn't. And hopefully that becomes just a
little bit clearer as
we walk through this process.
Now since I'm creating this dummy variable for West Africa,
I've already told
SPSS from that original variable, country by region, take all the
West Africa
cases and essentially flip that switch, turn them on to create this
new West Africa
dummy variable. All others turn off. Those are not West Africa.
So there are a couple different ways you can do that. I'm going
to show you the
what I would call quote, unquote the long way of doing this. So
if we have to,
remember we had three other groups to this variable. We had
the old value of 2,
which corresponded to East Africa, that's now going to be a 0.
We had 3 which
was southern Africa. Again, for our West Africa variable, it's
going to be 0. Don't
forget to hit add. In then four, North Africa, which is our
reference category, 0.
Add.
Now I could go ahead I could use the range. You know I could
13. have just done 2
through 4 equals 0. That would have worked. Or I could've also
done all other
values and then put the new value in of zero. So that would
have told SPSS, OK,
take 1 equals 1. All others 0. The only thing to be just a little
cautious about there
it is if you have user defined or system defined missing values.
They could
possibly be thrown in there. Again, it depends upon this specific
data set and
how things are coded in there.
So that's why I'm just showing you what I call quote unquote
the long way of
doing it today. Go ahead click continue. Be sure and hit this
radio icon change.
Once you hit that, you'll see the OK is sort of activated. We can
go ahead and
create that variable. We'll see we'll get some SPSS syntax
output here. And
there it is. It says our variables been created. OK, so here it is
West Africa.
Now we need to create two other dummy variables so that we
have our three
dummy variables here. So let's go ahead and see if we can
quickly do this. So
recode into different variables. One thing I always do is hit
reset because
everything you did before is still going to be in there, so I just
do this so I don't
confuse myself. Again, doing the same thing, country by region.
Except this one
I'm calling East Africa because I'm creating a separate dummy
variable for East
15. Dummy Variables
Again four our reference category, and our reference category is
going to be 0 in
each of them. Hit continue. Sure, again, hit change. If you don't
hit change, you'll
see that OK will not sort of light up, if you will, for you. You
have to go ahead, and
there we go. Hit OK. And there's our output, it's been created.
And one more.
Recode into different variables, reset, again, country by region,
our final variable,
which is Southern Africa.
Old and New value, so in the old value, the original country by
region that was
coded as a 3, and now it's going to be coded as 1 because we're
focusing on that
Southern Africa dummy variable, add. And all others 0. And
make sure we get
our reference North Africa in here, 0. Change. OK.
So now we can see we have West Africa, East Africa, and
Southern Africa set up
as dummy variables. And if we go click down here, if we go to
our data view,
you'll see our original variable, country by region, so 1,
remember 1, in the
original variable corresponded to West Africa. And if we go
over here and look at
our West Africa variable, you see that for those cases where the
original variable
was 1, coded West Africa, is now also 1 for the West Africa
16. dummy variable but 0
for East and Southern Africa.
And so if we go ahead and scroll down to East Africa, so here
we go. So for
cases where our original variable country by origin 2, remember
that equaled
East Africa, and still does in the original variable. But now if
we go over to West
Africa, 0, it's not West Africa, 1. It is East Africa. So for this
dummy variable those
things are matching up which is a good thing. And 0 for
Southern Africa. Again,
remember North Africa is serving as our reference category.
So now we've created dummy variables, we can go ahead and
use them in a
linear regression analysis. So let's just go ahead and run a very
quick linear
regression. So I've gone ahead and already entered those
variables in here. I'm
just going to use, we don't have many variables to work with, so
we're just going
to trust and government as our dependent variable. And only use
these, again,
just for the sake of demonstration these categorical dummy
variables as
independent variables. Click OK.
So by this point, you're probably very familiar with the output
around model
summary and the ANOVA, let's go ahead and look at the
coefficients. You'll see
here now that we have each of these dummy variables
represented. And so we
can go ahead and interpret these coefficients, these un-