Week 4 Lecture 10
We have been examining the question of equal pay for equal work for several weeks now; but have been somewhat frustrated with the equal work part. We suspect that salary varies with grade level, so that equal work is not done if we compare salaries across grades. We found that we could control the effect of grades with either of two techniques. The first is by choosing a variable that does not include grade level variation such as compa-ratios (the salary divided by midpoint). The second by statistically removing the impact of grade level using the ANOVA Two-factor without replication. Both of these gave us different outcomes on the question of male and female pay equality than examining salary only.
However, we still have not gotten a “clean” measure of equal work as there are still other factors that may impact work done such as performance levels (measured by the performance appraisal rating), seniority, education, etc. And, there could be gender bias (and, for real world companies, ethnic bias as well. We will not cover this, but it can be dealt with the same way as we will examine gender). We need to find a way to eliminate the impact of these variables on our pay measure as well.
This week we will look at two techniques that are very good at examining and explaining the influence of variables on outcomes. These are correlation and regression techniques. Linear Correlation
Correlation is a measure of how variables/things relate – that is, if one variable changes does another variable change in a predictable pattern as well? One very well-known example is the correlation (or relationship) between length/height of children and weight. As children become longer/taller their weight also increases (Tanner & Youssef-Morgan, 2013). Using this relationship, we can make predictions (using the technique of regression discussed in Lecture 11 for this week) about how heavy a child should be for any given height.
For variables that are at least interval in nature, two types of correlation exist for a bivariable (two variables only) relationship– linear and curvilinear. As they sound, linear correlations show the extent to which the data variables move in a straight line. Curvilinear correlations – which we will not cover – show the extent that variables move in curved lines.
Scatter Diagrams
An effective way to see if the data do relate in predictable ways involves generating a scatter diagram (AKA scatter chart) – a visual display of how the data points – (variable 1 value, corresponding variable 2 value) relate together (Lind, Marchel, & Wathen, 2008).
Example1. One relationship we might expect to show a positive (both values increasing) relationship would be salary and performance rating, either for the entire salary range or at least within grades. The following scatter diagram (made with the Excel Insert Graph functions) show the relationship with Performance Rating on the bottom and Salary on the on the .
Capitol Tech U Doctoral Presentation - April 2024.pptx
Week 4 Lecture 10 We have been examining the question of equal p.docx
1. Week 4 Lecture 10
We have been examining the question of equal pay for equal
work for several weeks now; but have been somewhat frustrated
with the equal work part. We suspect that salary varies with
grade level, so that equal work is not done if we compare
salaries across grades. We found that we could control the
effect of grades with either of two techniques. The first is by
choosing a variable that does not include grade level variation
such as compa-ratios (the salary divided by midpoint). The
second by statistically removing the impact of grade level using
the ANOVA Two-factor without replication. Both of these gave
us different outcomes on the question of male and female pay
equality than examining salary only.
However, we still have not gotten a “clean” measure of equal
work as there are still other factors that may impact work done
such as performance levels (measured by the performance
appraisal rating), seniority, education, etc. And, there could be
gender bias (and, for real world companies, ethnic bias as well.
We will not cover this, but it can be dealt with the same way as
we will examine gender). We need to find a way to eliminate
the impact of these variables on our pay measure as well.
This week we will look at two techniques that are very good at
examining and explaining the influence of variables on
outcomes. These are correlation and regression techniques.
Linear Correlation
Correlation is a measure of how variables/things relate – that is,
if one variable changes does another variable change in a
predictable pattern as well? One very well-known example is
the correlation (or relationship) between length/height of
children and weight. As children become longer/taller their
weight also increases (Tanner & Youssef-Morgan, 2013). Using
this relationship, we can make predictions (using the technique
of regression discussed in Lecture 11 for this week) about how
heavy a child should be for any given height.
2. For variables that are at least interval in nature, two types of
correlation exist for a bivariable (two variables only)
relationship– linear and curvilinear. As they sound, linear
correlations show the extent to which the data variables move in
a straight line. Curvilinear correlations – which we will not
cover – show the extent that variables move in curved lines.
Scatter Diagrams
An effective way to see if the data do relate in predictable ways
involves generating a scatter diagram (AKA scatter chart) – a
visual display of how the data points – (variable 1 value,
corresponding variable 2 value) relate together (Lind, Marchel,
& Wathen, 2008).
Example1. One relationship we might expect to show a positive
(both values increasing) relationship would be salary and
performance rating, either for the entire salary range or at least
within grades. The following scatter diagram (made with the
Excel Insert Graph functions) show the relationship with
Performance Rating on the bottom and Salary on the on the
vertical axis. It shows if we put a straight line through the data
points, there is a very modest increase from the lower left to
upper right.
Salary (Y-axis) and Performance Rating (X-axis)
Example2. If we look at the same variables, but include Grade
as a factor, we get the second graph (below) and see the data
separated by grade. Each grade seems to show (again, if we
were to put a straight line thru the data points for each grade)
level lines, indicating no correlation at all. Neither graph gives
us much hope that Performance Rating is related to Salary ,
something HR would probably not be happy with.
Salary Grades (Y-axis) and Performance Appraisal Rating (X-
axis)
Correlation
We will be focusing our efforts on the Pearson Correlation
3. Coefficient – a mathematical value that shows the strength of
the linear (straight line) relationship between two variables
(Lind, Marchel, & Wathen, 2008). The math formula is a bit
tedious, so we will not bother with it – but, if interested, you
can ask Excel to display it (either with Help or the “Tell me
what you want to do.” With the latter, I typed show help on
Pearson Correlation, and then selected the “show help…” line,
getting a description and the math formula.).
Pearson correlation ranges from a value of -1.00 to a +1.00.
Any value outside of this range indicates an error in the math or
setup. A perfect negative correlation (-1.00) means that the
data points all fit exactly on a line that runs from the upper left
corner to the lower right on a graph, a negative slope. A perfect
positive correlation (+1.00) has the line with a positive slope
and runs from the lower left to the upper right (Tanner &
Youssef-Morgan, 2013).
As the values move away from the perfect extremes, the data
points move away from a
line to a spread around the line. If we look at our first graph
above, the overall Salary and Performance Rating relationship,
we have a correlation of +.15, considered very low and not
particularly impressive.
Pearson Correlation. Excel finds the Pearson Correlation
Coefficient using either the fx function Correl or the Data
Analysis function Correlation. The former is used for a single
data set with two variables, while the latter can be used for a
single or multiple data sets. The Correl output for the
Performance Rating and Salary correlation result is:
Column Column
1 2
Column 1 1
Column 2 0.151307 1
Note the variable names are not included, and we have three
correlations. Two will always show a perfect +1.00 correlation
of column 1 with column 1 and column 2 with column 2; a
4. diagonal convention makes more sense with the Correlation
table we will look at below. The third correlation is the column
1 with column 2 variable. It does not matter which variable is
considered in column 1 or 2, as the result will be the same as
switching the variable columns.
We can use the Correlation function to identify correlations
between multiple data sets at the same time, much as
Descriptive Statistics could work with multiple variables at
once. In trying to identify what variables might be impacting
Salary, we could generate the following table. Remember, that
Pearson’s Correlation requires at least interval level data, so
that not all of our variables are used. In addition, since Salary
and Compa-ratio are two measures of the same thing (pay) we
do not want to include them in the same table.
Sal
Mid
Age
Perf Rat
Service
Raise
Sal
1.000
Mid
0.989
1.000
Age
6. Age = 0.544,
Mid = 0.567,
Age (itself) = 1.00,
Perf Rat = 0.139,
Service = 0.565, and Raise = -0.180.
Side note: now we can see why the correlation with itself is
shown in the tables, it provides the pivot point for reading the
table outcomes. The values above this diagonal of 1.00 values
would be identical to those below, so they are not provided to
make the table visually easier to read.
Coefficient of Determination. We will look at determining
statistical significance of correlations in lecture three for this
week. But, in the meantime, we can consider the Coefficient of
Determination as a rough measure of usefulness (we will look at
the effect size measure in lecture three as well). The coefficient
of determination is the square of the correlation, and represents
the percent of variation that the variables share in common; that
is, the amount of variation in one variable’s changes that is
explained by the variation in the other variable. So, for age and
salary, the coefficient equals 0.5442 = .30 (rounded). As a rule
of thumb, variable pairs with coefficients less than (<) 70% are
generally not very valuable for prediction purposes.
References
Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008).
Statistical Techniques in Business & Finance. (13th Ed.)
Boston: McGraw-Hill Irwin.
Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for
Managers. San Diego, CA: Bridgeport Education.
7. Week 4 Lecture 12 Significance
Earlier we discussed correlations without going into how we can
identify statistically significant values. Our approach to this
uses the t-test. Unfortunately, Excel does not automatically
produce this form of the t-test, but setting it up within an Excel
cell is fairly easy. And, with some slight algebra, we can
determine the minimum value that is statistically significant for
any table of correlations all of which have the same number of
pairs (for example, a Correlation table for our data set would
use 50 pairs of values, since we have 50 members in our
sample).
The t-test formula for a correlation (r) is t = r * sqrt(n-2)/sqrt(1-
r2); the associated degrees of freedom are n-2 (number of pairs
– 2) (Lind, Marchel, & Wathen, 2008). For some this might
look a bit off-putting, but remember that we can translate this
into Excel cells and functions and have Excel do the arithmetic
for us.
Excel Example
If we go back to our correlation table for salary, midpoint, Age,
Perf Rat, Service, and Raise, we have:
Using Excel to create the formula and cell numbers for our key
values allows us to quickly create a result. The T.dist.2t gives
us a p-value easily.
The formula to use in finding the minimum correlation value
that is statistically significant is r = sqrt(t^2/(t^2 + n-2)). We
would find the appropriate t value by using the
t.inv.2T(alpha, df) with alpha = 0.05 and df = n-2 or 48.
Plugging these values into the gives us a t-value of 2.0106 or
2.011(rounded).
Putting 2.011 and 48 (n-2) into our formula gives us a r value of
0.278; therefore, in a correlation table based on 50 pairs, any
correlation greater or equal to 0.278 would be statistically
significant.
Technical Point. If you are interested in how we obtained the
8. formula for determining the minimum r value, the approach is
shown below. If you are not interested in the math, you can
safely skip this paragraph.
t = r* sqrt(n-2)/sqrt(1-r2)
Multiplying gives us t *sqrt (1- r2) = r2* (n-2)
Squaring gives us: t2 * (1- r2) = r2* (n-2)
Multiplying out gives us: t2– t2* r2 = n r2-2* r2
Adding gives us: t2= n* r2-2*r2+ t2 *r2
Factoring gives us t2= r2 *(n -2+ t2)
Dividing gives us t2 / (n -2+ t2) = r2
Taking the square root gives us r = sqrt (t2 / (n -2+ t2) Effect
Size Measures
As we have discussed, there is a difference between statistical
and practical significance. Virtually any statistic can become
statistically significant if the sample is large enough. In
practical terms, a correlation of .30 and below is generally
considered too weak to be of any practical significance.
Additionally, the effect size measure for Pearson’s correlation
is simply the absolute value of the correlation; the outcome has
the same general interpretation as Cohen’s D for the t-test (0.8
is strong, and 0.2 is quite weak, for example) (Tanner &
YoussefMorgan, 2013). Spearman’s Rank Correlation
Another type of correlation is the Spearman’s rank order
correlation. This correlation, which is interpreted the same way
as the Pearson’s Correlation, can be performed on ordinal or
any ranked data. If the data used is ordinal (rankable), we use
Spearman’s rank order correlation, rho (Tanner & Youssef-
Morgan, 2013). Using the same data, only assuming at least one
variable is ordinal would give us the following results. Note in
ranking from low to high, similar values are given the average
rank for all of the same values. For example, in the example
below the raise of 4.7 occurs twice (the 3rd and 4th places), so
it gets a rank of 3.5.
PR-
Rank
12. Sum =
79
Spearman’s rank order correlation = 1-6*sum of differences
squared/(n*(n2 -1))
For this data, the sum of differences = 79, and n = 10. This
gives us a value of 1-6*(79/(10 *(102 -1))79 = 1 – 6*
(79/(10*99) = 1-6 * (79/990) = 1 – 6*0.08 = 0.52.
For comparison purposes, the Pearson Correlation equals 0.686.
Note that we have less information about the data when we use
ranks, particularly with several ties in the data. This reduced
information results in a lower correlation value with
Spearman’s. This correlation is tested and interpreted the same
way as Pearson’s Coefficient is (Lind, Marchel, & Wathen,
2008).
References
Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008).
Statistical Techniques in Business & Finance. (13th Ed.)
Boston: McGraw-Hill Irwin.
Tanner, D. E. & Youssef-Morgan, C. M. (2013). Statistics for
Managers. San Diego, CA: Bridgeport Education.
Week 3 Lecture 11
Regression Analysis
Regression analysis is the development of an equation that
shows the impact of the independent variables (the inputs we
can generally control) on the output result. While the
mathematical language may sound strange, most of you are
13. quite familiar with regression like instructions and use them
quite regularly.
To make a cake, we take 1 box mix, add 1¼ cups of water, ½
cup of oil, and 3 eggs. All of this is combined and cooked. The
recipe is an example of a regression equation. The output (or
result or dependent variable) is the cake, the inputs (or
independent variables) are the inputs used. Each input is
accompanied by a coefficient (AKA weight or amount) that tells
us how “much” of the variable is “used” or weighted into the
outcome.
So, in an equation format, this cake recipe might look like:
Y = 1X1 + 1.25X2 + .5X3 + 3X4 where:
Y = cake
X1 = box mix
X2 = cups of water
X3 = cups of oil X4 = an egg.
Of course, for the cake, the recipe needs to go through the
cooking process; while for other regression equations the
outputs need to go through whatever “process” turns the inputs
into the output – this is often called “life.” Example
With a regression analysis, we can identify what factors
influence an outcome. So, with our Salary issue, the natural
question to help us answer our research question of do males
and females get equal pay for equal work would be: what
factors influence or explain an individual’s pay? This is a
perfect question for a multi-variate regression. Multi-variate
simply means we have multiple input variables with a single
output variable (Lind, Marchel, & Wathen, 2008).
Variables. A regression analysis uses two distinct types of data.
The first are variables that are at least interval level or better
(the same as the other techniques we have used so far). The
other is called a dummy variable, a variable that can be coded 0
or 1 indicating the presence of some characteristic. In our data
set, we have two variables that can be used as dummy coded
variables in a regression, Degree and Gender; both coded 0 or 1.
In the case of Degree, the 0 stands for having a bachelor’s
14. degree and the 1 stands for having an advanced degree. For
Gender, 0 means a male and 1 means a female. How these are
interpreted in a regression output will be discussed below. For
now, the significance of dummy coding is that it allows us to
include nominal or ordinal data in our analysis.
Excel Approach. For our question of what factors influence
pay, we will use Excel’s Regression function found in the Data
Analysis section. This function will produce two output tables
of interest. The first table tests to see if the entire regression
equation is statistically significant; that is, do the input
variables significantly impact the output variable. If so, we
would then examine the second table – the coefficients used in a
regression equation for each of the variables. We would have a
second set of hypothesis statements for each variable, the null
would be the coefficient equals 0 versus an alternate of the
coefficient is not equal to 0. Typically, we list these before we
start the analysis.
Step 1: For the regression equation:
Ho: The regression equation is not significant Ha: The
regression equation is significant.
For the coefficients if the regression equation is
significant:
Ho: The regression coefficient equals 0
Ha: The regression coefficient is not equal to 0.
Note: We would write one pair of statements for each variable,
for space reasons, we include only one general statement that
should be applied to each variable.
Step 2: Reject each null hypothesis claim if the related p-value
> (is greater than) p-value = .05.
Step 3: Regression Analysis
Step 4: Perform the test. Selecting the Regression option in
Data Analysis will open a familiar data entry box. The Input Y
Range would be the salary range including the label. The Input
X range would the labels and data for our input variables. In
this case we will use Midpoint, Age, Performance Rating,
Service, Raise, Degree, and Gender. Be sure to check the labels
15. box and pick an output range upper left corner. This will result
in the following output (values rounded to three decimal
places):
Step 5: Conclusions and Interpretation. Let’s look at each table
separately.
The Regression Statistics table shows A Multiple R and an R
squared value. Multiple R is the multiple correlation value.
Similar to our Pearson Coefficient it shows the relationship
between the dependent (output or Salary in this case) variable
with all for the independent or input variables. Multiple R is
the multiple coefficient of determination, similar to the Pearson
coefficient of determination, it displays the percent of variation
in common between the dependent and all of the independent
variables.
The adjusted R square reduces the R square by a factor that
involves the number of variables and the sample size, a
suggestion if the design impacted the outcome more than the
variables. We have an insignificant reduction. The standard
error is a measure of variation in the outcome used for
predictions. The count shows the number of cases used in the
regression.
The ANOVA table, sometimes called ANOR – analysis of
regression – provides us with our test of significance outcome.
Similar to the ANOVA covered in Week 3, we look at the
Significance of F (AKA P-value) to see if we reject or fail to
reject the null hypothesis of no significance. In this case, with
a p-value of 8.44E-36 (equaling
0.00000000000000000000000000000000000844) is less than
.05, so we reject the null of no significance. The regression
equation explains a significant proportion of the variation in our
dependent variable of salary.
Now that we have a significant regression equation, we move on
to the final table that presents and tests the coefficients for each
variable. One of the important parts of a regression equation is
that it shows us the impact of each factor if all other factors are
16. held constant. A regression has the form:
Y = A + B1* X1 + B2*X2 + B3*X3 + …. Where Y is the
output, A is the intercept (places the line up or down on the Y
axis when all other values are 0), the B’s are the coefficient
values, and the X’s are the variable names. Before considering
whether each coefficient is statistically significant or not, our
equation would be:
Salary - -4.009 + 1.22* Midpoint + 0.029*Age – 0.096*Perf Rat
– 0.074*Service + 0.834*Raise + 1.002*Degree + 2.552*
Gender. Whew!
What does this mean? The intercept is an adjustment factor,
one that we do not need to analyze. For midpoint, it means that
as midpoint goes up by a thousand dollars (remember salary and
midpoint are measured in thousands), the salary goes up by 1.22
thousand – higher graded employees are paid relatively more
compared to midpoint than others (all others things equal). For
Performance Rating, employees lose $96 (-0.096) for every
higher PR point they have – certainly not what HR would like!
Now, let’s look at our dummy variables, Degree and Gender.
For Degree, an extra $1,002 is added to employees having a Deg
code = 1, as if Deg = 0, the +1.002* 0 = 0; so graduate degree
holders get an extra $1002 per year. The same thing applies to
Gender, those coded 0 get nothing extra and those coded 1 get
$2,552 more per year (all other things equal). Since females are
coded 1, if this factor is significant, they would be paid $2552
more than males with all other factors equal (the definition of
equal work).
So, now let’s take a look at the statistical significance of each
of the variables. This is determined with the P-value column
(next to the t Stat value). This is read the same way we noticed
in the t-test and ANVOA tables, if the value is less than 0.05 we
reject the null hypothesis of no significance.
While the intercept has a significance value, we tend to ignore
it and include the intercept in all equations. For the other
variables, the only significant variables are: Midpoint, Perf
Rating (unrounded it was 0.0497994…), and Gender. So, the
17. regression equation including only our statistically significant
factors is Sal = -4.009 +1.22*Midpoint -).096*Perf Rat +
2.552*Gender.
So, we now have a clear answer to our question about males and
females getting equal pay for equal work. Not only is the
answer no (as gender is a significant factor in determining
salary) but also females are paid $2552 more annually all other
things equal!
This is certainly not the outcome most of us expected when we
began this journey. What we see is that variation within any
measure has some often unanticipated outcomes, and unless we
examine the inputs into our results, we often do not understand
them very well. Single measure tests such as the t and ANOVA
tests are quite valuable comparing similar results, but they do
not always get to the root of what causes differences.
Reference
Lind, D. A., Marchel, W. G., & Wathen, S. A. (2008).
Statistical Techniques in Business & Finance. (13th Ed.)
Boston: McGraw-Hill Irwin.