BUS308 Week 4 Lecture 1
Examining Relationships
Expected Outcomes
After reading this lecture, the student should be familiar with:
1. Issues around correlation
2. The basics of Correlation analysis
3. The basics of Linear Regression
4. The basics of the Multiple Regression
Overview
Often in our detective shows when the clues are not providing a
clear answer – such as
we are seeing with the apparent continuing contradiction
between the compa-ratio and salary
related results – we hear the line “maybe we need to look at this
from a different viewpoint.”
That is what we will be doing this week.
Our investigation changes focus a bit this week. We started the
class by finding ways to
describe and summarize data sets – finding measures of the
center and dispersion of the data with
means, medians, standard deviations, ranges, etc. As interesting
as these clues were, they did not
tell us all we needed to know to solve our question about equal
work for equal pay. In fact, the
evidence was somewhat contradictory depending upon what
measure we focused on. In Weeks 2
and 3, we changed our focus to asking questions about
differences and how important different
sample outcomes were. We found that all differences were not
important, and that for many
relatively small result differences we could safely ignore them
for decision making purposes –
they were due to simple sampling (or chance) errors. We found
that this idea of sampling error
could extend into work and individual performance outcomes
observed over time; and that over-
reacting to such differences did not make much sense.
Now, in our continuing efforts to detect and uncover what the
data is hiding from us, we
change focus again as we start to find out why something
happened, what caused the data to act
as it did; rather than merely what happened (describing the data
as we have been doing). This
week we move from examining differences to looking at
relationships; that is, if some measure
changes does another measure change as well? And, if so, can
we use this information to make
predictions and/or understand what underlies this common
movement?
Our tools in doing this involve correlation, the measurement of
how closely two
variables move together; and regression, an equation showing
the impact of inputs on a final
output. A regression is similar to a recipe for a cake or other
food dish; take a bit of this and
some of that, put them together, and we get our result.
Correlation
We have seen correlations a lot, and probably have even used
them (formally or
informally). We know, for example, that all other things being
equal; the more we eat. the more
we weigh. Kids, up to the early teens, grow taller the older they
get. If we consistently speed,
we will get more speeding tickets than those who obey the
speed limit. The more efforts we put
into studying, the better grades we get. All of these are
examples of correlations.
Correlations exist in many forms. A somewhat specialized
correlation was the Chi
Square contingency test (for multi-row, multi-column tables) we
looked at last week, if we find
the distributions differ, then we say that the variables are
related/correlated. This correlation
would run from 0 (no correlation) thru positive values (the
larger the value the stronger the
relationship).
Probably the most commonly used correlation is the Pearson
Correlation Coefficient,
symbolized by r. It measures the strength of the association –
the extent to which measures change
together – between interval or ratio level measures. Excel’s Fx
Correl, and the Data Analysis
Correlation both produce Pearson Correlations.
Most correlations that we are familiar with show both the
direction (direct or inverse) as
well as the strength of the relationship, and run from -1.0 (a
strong and perfect inverse
correlation) through 0 (a weak and non-existent correlation) to
+1.0 (a strong an perfect direct
correlation). A direct correlation is positive; that is, both
variables move in the same direction,
such as weight and height for kids. An inverse, or negative,
correlation has variables moving in
different directions. For example, the number of hours you sleep
and how tired you feel; the
more hours, the less tired while the fewer hours, the more tired.
The strength of a correlation is shown by the value (regardless
of the sign). For example,
a correlation of +.78 is just as strong as a correlation of -.78;
the only difference is the direction
of the change. If we graphed a +.78 correlation the data points
would run from the lower left to
the upper right and somewhat cluster around a line we could
draw thru the middle of the data
points. A graph of a -.78 correlation would have the data points
starting in the upper left and run
down to the lower right. They would also cluster around a line.
Correlations below an absolute value (when we ignore the plus
or minus sign) of around
.70 are generally not considered to be very strong. The reason
for this is due to the coefficient of
determination(CD). This equals the square of the correlation
and shows the amount of shared
variation between the two variables. Shared variation can be
roughly considered the reason that
both variables move as they do when one changes. The more
the shared variation, the more one
variable can be used to predict the other. If we square .70 we
get .49, or about 50% of the
variation being shared. Anything less is too weak of a
relationship to be of much help.
Students often feel that a correlation shows a “cause-and-effect”
relationship; that is,
changes in one thing “cause” changes in the other variable. In
some cases, this is true – height
and weight for pre-teens, weight and food consumption, etc. are
all examples of possible cause-
and- effect relationships; but we can argue that even with these
there are other variables that
might interfere with the outcomes. And, in research, we cannot
say that one thing causes or
explains another without having a strong correlation present.
However, just as our favorite detectives find what they think is
a cause for someone to
have committed the crime, only to find that the motive did not
actually cause that person to
commit the crime; a correlation does not prove cause-and-
effect. An example of this is the
example the author heard in a statistics class of a perfect +1.00
correlation found between the
barrels of rum imported into the New England region of the
United States between the years of
1790 and 1820 and the number of churches built each year. If
this correlation showed a cause-
and-effect relationship, what does it mean? Does rum drinking
(the assumed result of importing
rum) cause churches to be built? Does the building of churches
cause the population to drink
more rum?
As tempting as each of these explanations is, neither is
reasonable – there is no theory or
justification to assume either is true. This is a spurious
correlation – one caused by some other,
often unknown, factor. In this case, the culprit is population
growth. During these years – many
years before Carrie Nation’s crusade against Demon Rum – rum
was the common drink for
everyone. It was even served on the naval ships of most
nations. And, as the population grew,
so did the need for more rum. At the same time, churches in the
region could only hold so many
bodies (this was before mega-churches that held multiple
services each Sunday); so, as the
population got too large to fit into the existing churches, new
ones were needed.
At times, when a correlation makes no sense we can find an
underlying variable fairly
easily with some thought. At other times, it is harder to figure
out, and some experimentation is
needed. The site http://www.tylervigen.com/spurious-
correlations is an interesting website
devoted to spurious correlations, take a look and see if you can
explain them. ��
Regression
Linear. Even if the correlation is spurious, we can often use the
data in making
predictions until we understand what the correlation is really
showing us. This is what
regression is all about. Earlier correlations between age,
height, and even weight were
mentioned. In pediatrician offices, doctors will often have
charts showing typical weights and
heights for children of different ages. These are the results of
regressions, equations showing
relationships. For example (and these values are made up for
this example), a child’s height
might be his/her initial height at birth plus and average growth
of 3.5 inches per year. If the
average height of a newborn child is about 19 inches, then the
linear regression would be:
Height = 19 inches plus 3.5 inches * age in years, or in math
symbols:
Y = a + b*x, where y stands for height, a is the intercept or
initial value at age 0
(immediate birth), b is the rate of growth per year, and x is the
age in years.
In both cases, we would read and interpret it the same way: the
expected height of a child is 19
inches plus 3.5 inches times its age. For a 12-year old, this
would be 19 + 3.5*12 = 19 + 42 = 61
inches or 5 feet 1 inch (assuming the made-up numbers are
accurate).
Multiple. That was an example of a linear regression having
one output and a single,
independent variable as an input. A multiple regression
equation is quite similar but has several
independent input variables. It could be considered to be
similar to a recipe for a cake:
http://www.tylervigen.com/spurious-correlations
Cake = cake mix + 2* eggs + 1½ * cup milk + ½ * teaspoon
vanilla + 2 tablespoons* butter.
A regression equation, either linear or multiple, shows us how
“much” each factor is used in or
influences the outcome. The math format of the multiple
regression equation is quite similar to
that of the linear regression, it just includes more variables:
Y = a + b1*X1 + b2*X2 + b3*X3 + …; where a is the intercept
value when all the inputs
are 0, the various b’s are the coefficients that are multiplied by
each variable value, and
the x’s are the values of each input.
A note on how to read the math symbols in the equations. The
Y is considered the output or
result, and is often called the dependent variable as its value
depends on the other factors. The
different b’s (b1, b2, etc.) are coefficients and read b-sub-1, b-
sub-2, etc. The subscripts 1, 2, etc.
are used to indicate the different coefficient values that are
related to each of the input variables.
The X-sub-1, X-sub-2, etc., are the different variables used to
influence the output, and are called
independent variables. In the recipe example, Y would be the
quality of the cake, a would be the
cake mix (a constant as we use all of what is in the box), the
other ingredients would relate to the
b*X terms. The 2*eggs would relate to b1*X1, where b1 would
equal 2 and X1 stands for eggs,
the second input relates to the milk, etc.
Summary
This week we changed our focus from examining differences to
looking for relationships
– do variables change in predictable ways. Correlation lets us
see both the strength and the
direction of change for two variables. Regression allows us to
see how some variables “drive” or
explain the change in another.
Pearson’s (for interval and ratio data variables) and Spearman’s
(for rank ordered or
ordinal data variables) are the two most commonly used
correlation coefficients. Each looks at
how a pair of variables moves in predictable patterns – either
both increasing together or one
increasing as the other decreases. The correlation ranges from -
1.00 (moving in opposite
directions) to +1.00 (moving in the same direction). These are
both examples of linear
correlation – how closely the variables move in a straight line
(if graphed). Curvilinear
corrections exist but are not covered in this class.
Regression equations show the relationship between
independent (input) variables and a
dependent (output variables). Linear regression involves a pair
of variables as seen in the linear
correlations. Multiple regression uses several input
(independent) variables for a single output
(dependent) variable.
The basic form of the regression equation is the same for both
linear and multiple
regression equations. The only difference is in the number of
inputs used. The multiple
regression equation general form is:
Y = Intercept + coefficient1 * variable1 + coefficient2 *
variable2 + etc. or
Y = A + b1*X1 + b2*X2 + …; where A is the intercept value, b
is a coefficient value, and
X is the name of a variable, and the subscripts identify different
variables.
Summary
This week we changed focus from examining differences to
examining relationships –
how variables might move in predictable patterns. This, we
found, can be done with either
correlations or regression equations.
Correlations measure both the strength (the value of the
correlation) and the direction (the
sign) of the relationship. We looked at the Pearson Correlation
(for interval and ratio level data)
and the Spearman’s Rank Order Correlation (for ordinal level
data). Both range from -1.00 (a
perfect inverse correlation where as one value increases the
other decreases) to +1.00 (a perfect
direct correlation where both value increase together). A
perfect correlation means the data
points would fall on a straight line if graphed. One interesting
characteristic of these correlations
occurs when you square the values. This produces the
Coefficient of Determination (CD), which
gives us an estimate of how much variation is in common
between the two variables. CD values
of less than .50 are not particularly useful for practical
purposes.
Regression equations provide a formula that shows us how
much influence an input
variable has on the output; that is, how much the output changes
for a given change in an input.
Regression equations are behind such commonly used
information such as the relationship
between height and weight for children that doctors use to
assess our children’s development.
That would be a linear regression, Weight = constant +
coefficient*height in inches or Y = A +
b*X, where Y stands for weight, A is the constant, b is the
coefficient, and X is the height. A
multiple regression is conceptually the same but has several
inputs impacting a single output.
If you have any questions on this material, please ask your
instructor.
After finishing with this lecture, please go to the first
discussion for the week, and engage
in a discussion with others in the class over the first couple of
days before reading the second
lecture.

BUS308 Week 4 Lecture 1 Examining Relationships Expect.docx

  • 1.
    BUS308 Week 4Lecture 1 Examining Relationships Expected Outcomes After reading this lecture, the student should be familiar with: 1. Issues around correlation 2. The basics of Correlation analysis 3. The basics of Linear Regression 4. The basics of the Multiple Regression Overview Often in our detective shows when the clues are not providing a clear answer – such as we are seeing with the apparent continuing contradiction between the compa-ratio and salary related results – we hear the line “maybe we need to look at this from a different viewpoint.” That is what we will be doing this week. Our investigation changes focus a bit this week. We started the class by finding ways to describe and summarize data sets – finding measures of the center and dispersion of the data with means, medians, standard deviations, ranges, etc. As interesting as these clues were, they did not tell us all we needed to know to solve our question about equal work for equal pay. In fact, the evidence was somewhat contradictory depending upon what
  • 2.
    measure we focusedon. In Weeks 2 and 3, we changed our focus to asking questions about differences and how important different sample outcomes were. We found that all differences were not important, and that for many relatively small result differences we could safely ignore them for decision making purposes – they were due to simple sampling (or chance) errors. We found that this idea of sampling error could extend into work and individual performance outcomes observed over time; and that over- reacting to such differences did not make much sense. Now, in our continuing efforts to detect and uncover what the data is hiding from us, we change focus again as we start to find out why something happened, what caused the data to act as it did; rather than merely what happened (describing the data as we have been doing). This week we move from examining differences to looking at relationships; that is, if some measure changes does another measure change as well? And, if so, can we use this information to make predictions and/or understand what underlies this common movement? Our tools in doing this involve correlation, the measurement of how closely two variables move together; and regression, an equation showing the impact of inputs on a final output. A regression is similar to a recipe for a cake or other food dish; take a bit of this and some of that, put them together, and we get our result. Correlation
  • 3.
    We have seencorrelations a lot, and probably have even used them (formally or informally). We know, for example, that all other things being equal; the more we eat. the more we weigh. Kids, up to the early teens, grow taller the older they get. If we consistently speed, we will get more speeding tickets than those who obey the speed limit. The more efforts we put into studying, the better grades we get. All of these are examples of correlations. Correlations exist in many forms. A somewhat specialized correlation was the Chi Square contingency test (for multi-row, multi-column tables) we looked at last week, if we find the distributions differ, then we say that the variables are related/correlated. This correlation would run from 0 (no correlation) thru positive values (the larger the value the stronger the relationship). Probably the most commonly used correlation is the Pearson Correlation Coefficient, symbolized by r. It measures the strength of the association – the extent to which measures change together – between interval or ratio level measures. Excel’s Fx Correl, and the Data Analysis Correlation both produce Pearson Correlations. Most correlations that we are familiar with show both the direction (direct or inverse) as well as the strength of the relationship, and run from -1.0 (a strong and perfect inverse
  • 4.
    correlation) through 0(a weak and non-existent correlation) to +1.0 (a strong an perfect direct correlation). A direct correlation is positive; that is, both variables move in the same direction, such as weight and height for kids. An inverse, or negative, correlation has variables moving in different directions. For example, the number of hours you sleep and how tired you feel; the more hours, the less tired while the fewer hours, the more tired. The strength of a correlation is shown by the value (regardless of the sign). For example, a correlation of +.78 is just as strong as a correlation of -.78; the only difference is the direction of the change. If we graphed a +.78 correlation the data points would run from the lower left to the upper right and somewhat cluster around a line we could draw thru the middle of the data points. A graph of a -.78 correlation would have the data points starting in the upper left and run down to the lower right. They would also cluster around a line. Correlations below an absolute value (when we ignore the plus or minus sign) of around .70 are generally not considered to be very strong. The reason for this is due to the coefficient of determination(CD). This equals the square of the correlation and shows the amount of shared variation between the two variables. Shared variation can be roughly considered the reason that both variables move as they do when one changes. The more the shared variation, the more one variable can be used to predict the other. If we square .70 we get .49, or about 50% of the variation being shared. Anything less is too weak of a relationship to be of much help.
  • 5.
    Students often feelthat a correlation shows a “cause-and-effect” relationship; that is, changes in one thing “cause” changes in the other variable. In some cases, this is true – height and weight for pre-teens, weight and food consumption, etc. are all examples of possible cause- and- effect relationships; but we can argue that even with these there are other variables that might interfere with the outcomes. And, in research, we cannot say that one thing causes or explains another without having a strong correlation present. However, just as our favorite detectives find what they think is a cause for someone to have committed the crime, only to find that the motive did not actually cause that person to commit the crime; a correlation does not prove cause-and- effect. An example of this is the example the author heard in a statistics class of a perfect +1.00 correlation found between the barrels of rum imported into the New England region of the United States between the years of 1790 and 1820 and the number of churches built each year. If this correlation showed a cause- and-effect relationship, what does it mean? Does rum drinking (the assumed result of importing rum) cause churches to be built? Does the building of churches cause the population to drink more rum? As tempting as each of these explanations is, neither is reasonable – there is no theory or justification to assume either is true. This is a spurious
  • 6.
    correlation – onecaused by some other, often unknown, factor. In this case, the culprit is population growth. During these years – many years before Carrie Nation’s crusade against Demon Rum – rum was the common drink for everyone. It was even served on the naval ships of most nations. And, as the population grew, so did the need for more rum. At the same time, churches in the region could only hold so many bodies (this was before mega-churches that held multiple services each Sunday); so, as the population got too large to fit into the existing churches, new ones were needed. At times, when a correlation makes no sense we can find an underlying variable fairly easily with some thought. At other times, it is harder to figure out, and some experimentation is needed. The site http://www.tylervigen.com/spurious- correlations is an interesting website devoted to spurious correlations, take a look and see if you can explain them. �� Regression Linear. Even if the correlation is spurious, we can often use the data in making predictions until we understand what the correlation is really showing us. This is what regression is all about. Earlier correlations between age, height, and even weight were mentioned. In pediatrician offices, doctors will often have charts showing typical weights and heights for children of different ages. These are the results of regressions, equations showing relationships. For example (and these values are made up for
  • 7.
    this example), achild’s height might be his/her initial height at birth plus and average growth of 3.5 inches per year. If the average height of a newborn child is about 19 inches, then the linear regression would be: Height = 19 inches plus 3.5 inches * age in years, or in math symbols: Y = a + b*x, where y stands for height, a is the intercept or initial value at age 0 (immediate birth), b is the rate of growth per year, and x is the age in years. In both cases, we would read and interpret it the same way: the expected height of a child is 19 inches plus 3.5 inches times its age. For a 12-year old, this would be 19 + 3.5*12 = 19 + 42 = 61 inches or 5 feet 1 inch (assuming the made-up numbers are accurate). Multiple. That was an example of a linear regression having one output and a single, independent variable as an input. A multiple regression equation is quite similar but has several independent input variables. It could be considered to be similar to a recipe for a cake: http://www.tylervigen.com/spurious-correlations Cake = cake mix + 2* eggs + 1½ * cup milk + ½ * teaspoon vanilla + 2 tablespoons* butter. A regression equation, either linear or multiple, shows us how “much” each factor is used in or
  • 8.
    influences the outcome.The math format of the multiple regression equation is quite similar to that of the linear regression, it just includes more variables: Y = a + b1*X1 + b2*X2 + b3*X3 + …; where a is the intercept value when all the inputs are 0, the various b’s are the coefficients that are multiplied by each variable value, and the x’s are the values of each input. A note on how to read the math symbols in the equations. The Y is considered the output or result, and is often called the dependent variable as its value depends on the other factors. The different b’s (b1, b2, etc.) are coefficients and read b-sub-1, b- sub-2, etc. The subscripts 1, 2, etc. are used to indicate the different coefficient values that are related to each of the input variables. The X-sub-1, X-sub-2, etc., are the different variables used to influence the output, and are called independent variables. In the recipe example, Y would be the quality of the cake, a would be the cake mix (a constant as we use all of what is in the box), the other ingredients would relate to the b*X terms. The 2*eggs would relate to b1*X1, where b1 would equal 2 and X1 stands for eggs, the second input relates to the milk, etc. Summary This week we changed our focus from examining differences to looking for relationships – do variables change in predictable ways. Correlation lets us see both the strength and the direction of change for two variables. Regression allows us to see how some variables “drive” or
  • 9.
    explain the changein another. Pearson’s (for interval and ratio data variables) and Spearman’s (for rank ordered or ordinal data variables) are the two most commonly used correlation coefficients. Each looks at how a pair of variables moves in predictable patterns – either both increasing together or one increasing as the other decreases. The correlation ranges from - 1.00 (moving in opposite directions) to +1.00 (moving in the same direction). These are both examples of linear correlation – how closely the variables move in a straight line (if graphed). Curvilinear corrections exist but are not covered in this class. Regression equations show the relationship between independent (input) variables and a dependent (output variables). Linear regression involves a pair of variables as seen in the linear correlations. Multiple regression uses several input (independent) variables for a single output (dependent) variable. The basic form of the regression equation is the same for both linear and multiple regression equations. The only difference is in the number of inputs used. The multiple regression equation general form is: Y = Intercept + coefficient1 * variable1 + coefficient2 * variable2 + etc. or Y = A + b1*X1 + b2*X2 + …; where A is the intercept value, b is a coefficient value, and X is the name of a variable, and the subscripts identify different
  • 10.
    variables. Summary This week wechanged focus from examining differences to examining relationships – how variables might move in predictable patterns. This, we found, can be done with either correlations or regression equations. Correlations measure both the strength (the value of the correlation) and the direction (the sign) of the relationship. We looked at the Pearson Correlation (for interval and ratio level data) and the Spearman’s Rank Order Correlation (for ordinal level data). Both range from -1.00 (a perfect inverse correlation where as one value increases the other decreases) to +1.00 (a perfect direct correlation where both value increase together). A perfect correlation means the data points would fall on a straight line if graphed. One interesting characteristic of these correlations occurs when you square the values. This produces the Coefficient of Determination (CD), which gives us an estimate of how much variation is in common between the two variables. CD values of less than .50 are not particularly useful for practical purposes. Regression equations provide a formula that shows us how much influence an input variable has on the output; that is, how much the output changes for a given change in an input. Regression equations are behind such commonly used
  • 11.
    information such asthe relationship between height and weight for children that doctors use to assess our children’s development. That would be a linear regression, Weight = constant + coefficient*height in inches or Y = A + b*X, where Y stands for weight, A is the constant, b is the coefficient, and X is the height. A multiple regression is conceptually the same but has several inputs impacting a single output. If you have any questions on this material, please ask your instructor. After finishing with this lecture, please go to the first discussion for the week, and engage in a discussion with others in the class over the first couple of days before reading the second lecture.