Correlation and Regression Analysis of Family Size and Credit Cards

05/15/15 Slide 1
• Using a combination of tables and plots from SPSS plus
spreadsheets from Excel, we will show the linkage between
correlation and linear regression.
• Correlation and regression provide us with different, but
complementary, information on the relationship between
two quantitative variables.

The goal of this analysis is to study
the relationship between family
size and number of credit cards.
Finding the relationship will help us
predict the number of credit cards
a family typically has relative to the
number of family members. If a
family had fewer than expected,
they would be a good candidate for
us to extend another credit card
offer.
CreditCardData.sav has five
variables for 8 cases.
The data for the 8 cases is
shown in the Data View to the
left. The names and labels for
each of the variables is shown
below in the Variable View.

Creating a histogram of the
dependent variable, ncards, shows a
distribution that is about as normal as
we could expect for only 8 cases. I
have superimposed the red normal
curve and blue mean line on the
histogram.
For any quantitative variable,
our best estimate of the values
for cases in the distribution is
the mean, because it minimizes
the errors or differences
between the estimated value
and the actual score
represented by each of the bars
in the histogram.

To demonstrate that the mean is the best
value to estimate, I created a worksheet in
Excel that compares the error associated
with three different estimates of values for
each case: the mean of 7, an estimate
lower than the mean: 6, and an estimate
higher than the mean: 8.
Error is calculated as the sum of the squared deviations
from the value used as the estimate. Columns C, F and I
contain the deviations from each of the estimates 7, 6,
and 8). Columns D, G, and J contain the squared
deviations, and the summed total at the base of the
columns.
Using the mean of 7 as the estimate, there are 22 units of
error. Using either 6 or 8 results in 30 units of error.
The measure of error is called the Total Sum of Squares.

The graph for the relationship
between two quantitative
variables is the scatterplot, with
the independent variable Family
Size on the horizontal x-axis, and
the dependent variable Number
of Credit Cards on the vertical y-
axis.
I have superimposed the
blue dotted mean line for
Number of Credit Cards on
the scatterplot. We see
that the scores for two
cases actually fall on the
mean line, while the other
six are at varying distances
from the mean line.
Each dot represents the
combination of scores
for one case. For
example, this dot
represents a family of 5
that had 8 credit cards.

The purple lines are the deviations
– the differences between
individual scores and the mean of
the dependent variable.
If we square the deviations
and sum the squares, we have
the Total Sum of Squares.
The differences are often phrased as
distances, i.e. the vertical distance
between the mean line and the score for
this cases on the dependent variable is 3.

I have added the green vertical
dotted line at the mean number
of credit cards, 4.25.
The regression line will pass
through the intersection of
the means of both variables,
and will minimize the total
sum of the differences
between the individual scores
and the regression line.

One way to think about linear
regression is that we are
rotating a line through the
intersection of the means of the
two variables.
Each time we rotate the
line, we would compute
the total sum of
squares.
We stop when we have
found the line that has
the smallest total sum
of squares.
There is a direct method for finding
the regression line that does not
require this trial and error strategy.

If there is no relationship, the blue
regression line will be on top of (or
very close) to the dotted blue
mean line for the dependent
variable.
No relationship means that
we can not reduce the
error or total sum of
squares of the dependent
variable by using the
relationship to the
independent variable.

The points along the regression line
represent the estimated values for all
possible values of the independent
variable.
For example, if we wanted to estimate
the number of cards for a family of 4, we
would draw a vertical line from the 4 on
the horizontal axis up to the regression
line, and from the regression line left to
the vertical axis. The location on the
vertical axis is the estimated number of
cards that a family of 4 would have, i.e.
about 6.8 cards.

The differences between the estimated
value and the actual value for the cases
are deviations that are called residuals
(the light blue lines). They represent
errors in predicting the values of the
dependent variable based on the value
of the independent variable.
We had two cases with a family size of 4.
Our estimated value was overstated for
one of the cases, and understated for
the other case.

The formula for the regression line can be
extracted from the SPSS output.
For this example, the regression equation
is:
ncards = 2.871 + .971 x famsize

We can plug the regression
equation into Excel and estimate
the number of cards for each
case.
To compute the residuals,
we subtract the actual value
for ncards from the
estimated value for the case.
If we square the residuals, and
sum the squares, we have the
amount of error associated with
using the regression line to
estimate each case, 5.485758.

If we plug the total sum of squares and the sum of
squared residuals into an Excel spreadsheet, we
can compute the reduction in the total sum of
squared errors associated with using the
information in the independent variable, as
represented by the regression equation.
We can compute the percentage of total error
reduced by the regression equation, we end
up with the value of R², the percentage of
variance explained by the regression
relationship.
Our calculation for R² agrees with
the value of R Square in the SPSS
output.

R² is often interpreted as the percentage of
variance explained.
We can convert our Sum of Squares column to
Variance by dividing by the number of cases in
the sample minus one (8 – 1).
If we compute the percentages using variances
instead of sum of squares, we end with exactly
the sample value for R², 0.750647.
R² is also interpreted as the proportional reduction in error ( a
PRE statistic), which we can also phrase as an increase in
accuracy.
We should remember the no matter whether we interpret R² as
explaining variance or reducing error, the statistic applies to the
total error in distribution, not to the error in individual cases.

We can also think of regression and correlation as
based on the pattern of deviations for the two variable
across the cases in the distribution.
To present this, we will first compute the standard
scores for each variable. As standard scores, the value
for each case is the deviation from the mean of 0
which is the mean of the distribution of standard
scores.

Plotting the z-scores for
both variables produces
the same pattern in the
scatterplot that we found
with the raw data.
As we would expect for standard
scores, the green dotted line for
the mean z-score for family size is
at zero, as is the dotted blue line
for the standard scores for number
of credit cards.

We add lines for the
deviation from the
means for both variables.
The green deviation lines
represent differences
from the mean z-score
for family size.
The blue deviation
lines represent
differences from the
mean z-score for
number of credit cards.

For some points, the
length of the green
deviation line is similar to
the length of the blue
deviation line.
The strength of the relationship will
depend on the agreement of the
deviations for each case, i.e. the extent
to which the green line deviation for a
case agrees with the blue line deviation.
For other points, the
length of the green
deviation line is shorter
than the length of the blue
deviation line.

Overall, the pattern of the deviations is similar. Green
deviations above the mean are paired with blue
deviations above the mean. Green deviations below the
mean are paired with blue deviations below the mean.
Though the length of the deviations for individual cases
varies, the overall pattern suggests a strong
relationship.

To compute the correlation
coefficient, we multiply the
z-scores, and sum across all
the cases.
To compute Pearson’s r, we divide
the sum of the z-score products
by the number of cases minus
one.
The value for Pearson’s r that
we computed agrees with the
value supplied by SPSS. Finally, if we square the
value of Pearson’s r, we have
the same value as R Square
in the SPSS regression
output.

If we return to the regression results for
the raw data instead of the standard
scores, we can show the link between
Pearson’s r and the slope in the regression
equation.
Recall that the slope of the regression line
represents change in the dependent variable
associated with a one unit change in the
independent variable. Thus, when a family had one
more member, we would predict that they had .971
more credit cards.
Think of the standard deviation to
be a measure of average
difference from the mean for all of
the cases for each of the variables.
The standard deviation for number
of cares is 1.773 and the standard
deviation for family size is 1.581.

If the relationship between the two variables were
perfect (one predicted the other without error), we
could compute slope of the line using the average
amount of differences in each of the distributions –
the standard deviations.
On average, the number of cards would
go up 1.773 cards for a difference of
1.581 members in a family. We can
simplify this by dividing the standard
deviation for number of cards by the
standard deviation for family size:
1.773 ÷ 1.581 = 1.121
Thus, if the relationship were perfect, we
would increase our estimate of the
number of cards in a family by 1.121 for
every additional member of a family.

If the slope of the regression line were
1.121 when the relationship were
perfect, then we might expect the
slope to be 0.866 x 1.121 when the
relationship was less than perfect.
And it fact, that turns out to be true,
since:
0.866 x 1.121 = 0.971
The slope of the regression line is the
ratio of the standard deviations
multiplied by the correlation
coefficient.
If the relationship between the two
variables were perfect, Pearson’s r would
be 1.0 (or -1.0 if the relationship were
inverse).
However, we know that Pearson’s r is less
than that, actually it is 0.866.

Correlation and Regression Analysis of Family Size and Credit Cards

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Correlation and Regression Analysis of Family Size and Credit Cards

Similar to Correlation and Regression Analysis of Family Size and Credit Cards (20)

Recently uploaded

Recently uploaded (20)

Correlation and Regression Analysis of Family Size and Credit Cards