8. The first part of the output, Within-Subjects Factors, confirms
that we set up the repeated measures
successfully. Here we expect and have the five quizzes listed as
dependent variables. In your assignment you
should see each of the four tutors listed (if not, do not pass go,
go back and restart the SPSS specification as
detailed in Section 1 of this tutorial.
The Descriptive Statistics portion of the output provides the
mean and standard deviation for each of the five
quizzes. This information is needed for the APA table (see
Sample APA Table section of this tutorial).
Within-Subjects Factors
Measure: MEASURE_1
quizzes Dependent
Variable
1 quiz1
2 quiz2
3 quiz3
10. assumption was violated, Mauchly’s W(9, N = 105) = 93.85, p <
.001. So, in the Tests of Within-Subjects
Effects, we cannot use the sphericity assumed results. Instead,
for purposes of this assignment, we need to
choose between the Greenhouse-Geisser adjusted results or the
Huynh-Feldt adjusted results. In this case, we
would not draw different conclusions using either, but as a
general rule, use Greenhouse-Geisser if its epsilon
value is less than .75, otherwise use Huynh-Feldt1
Mauchly's Test of Sphericitya
.
Measure: MEASURE_1
Within Subjects
Effect
Mauchly's W Approx. Chi-
Square
df Sig. Epsilonb
Greenhouse-
Geisser
Huynh-Feldt Lower-bound
quizzes .400 93.851 9 .000 .640 .657 .250
Tests the null hypothesis that the error covariance matrix of the
orthonormalized transformed dependent variables is
proportional to an
identity matrix.
11. a. Design: Intercept
Within Subjects Design: quizzes
b. May be used to adjust the degrees of freedom for the
averaged tests of significance. Corrected tests are displayed in
the Tests of
Within-Subjects Effects table.
Tests of Within-Subjects Effects
The Greenhouse-Geisser adjusted test of mean differences
across the five repeated quiz measures was
statistically significant, F(2.56, 266.10) = 3.049. p = .037, ηp2
= η2 = .0282, ω2 = .0153
(a small effect).
Tests of Within-Subjects Effects
Measure: MEASURE_1
Source Type III Sum of
Squares
df Mean Square F Sig. Partial Eta
Squared
quizzes
Sphericity Assumed 18.819 4 4.705 3.049 .017 .028
Greenhouse-Geisser 18.819 2.559 7.355 3.049 .037 .028
Huynh-Feldt 18.819 2.629 7.159 3.049 .035 .028
13. associated with the lowest
of the five quiz scores.
Orchestra music (quiz2)
and pop music (quiz3)
were associated with the
highest scores. Radio shock
news (quiz4) and
jackhammers (quiz5) were
associated with scores
higher than those for crowd
noise and somewhat lower
than the two music
conditions. Had this been
real research, one would
have selected the noise
conditions with some a
priori theoretical
explanation for expected
differences and interpret
the actual result in light of
the theoretical expectations
(I leave such to your
scientific imagination).
Visual depictions can be misleading, which is why we rely on
statistical tests to determine which quiz means
were different from the others. Tests of pairwise comparisons
are part of the Estimated Marginal Means (EMM)
portion of the output. The first two parts of the EMM output
provide mean, standard error, and 95% confidence
intervals for the mean. These are useful for reference, but the
meat (or tofu) is in the output labeled Pairwise
Comparisons (see next page).
15. is the same as the pairwise comparison of quiz 2
with quiz 1. You will avoid redundant results if you consider
only the quiz numbers in the 2nd column that are
numbered higher than the quiz number in the 1st column. For
example, for the five rows of information
associated with 1st column quiz 3, only consider rows 4 and 5.
In this example, only two pairs of quizzes statistically
significantly differed using Bonferroni adjusted p values.
The crowd conversing condition (quiz1) had a lower mean (MD
= -0.514, p = .049) than the orchestra music
condition (quiz2), and a lower mean (MD = -0.514, p = .001)
than the pop music condition (quiz3).
Notice in the table that there are significance values of 1.000.
Just like p cannot equal .000, it cannot equal
1.000. In such cases, report as p > .999).
Also notice that quiz2 and quiz3 had the same mean and the
same mean difference from quiz1, but one had a p
value of .049 and the other .001. If curious, see the “For the
Inquisitive…” section of this tutorial.
Pairwise Comparisons
Measure: MEASURE_1
(I) quizzes (J) quizzes Mean Difference
(I-J)
Std. Error Sig.b 95% Confidence Interval for
Differenceb
Lower Bound Upper Bound
18. However, suppose quiz1 was taken after drinking a glass of
water (the control condition), quiz2 after 1 cup of
coffee, quiz3 after 2 cups of coffee, quiz4 after 3 cups of
coffee, and quiz5 after 4 cups of coffee. The
nonsignificant result for the linear contrast, p = .091, would
indicate that as coffee increases linearly there is not
a corresponding linear improvement in quiz score. The
significant quadratic result, p = .006, would indicate that
as coffee increases linearly, quiz results increase to a plateau
then decrease—a curvilinear effect that can be
visually seen in the profile plot. So, in this scenario of the quiz
conditions, coffee helps up to a point, then hurts
quiz performance.
Tests of Within-Subjects Contrasts
Measure: MEASURE_1
Source quizzes Type III Sum of
Squares
df Mean Square F Sig. Partial Eta
Squared
quizzes
Linear 4.024 1 4.024 2.917 .091 .027
Quadratic 8.686 1 8.686 7.858 .006 .070
Cubic 6.095 1 6.095 2.323 .131 .022
Order 4 .014 1 .014 .013 .910 .000
19. Error(quizzes)
Linear 143.476 104 1.380
Quadratic 114.956 104 1.105
Cubic 272.905 104 2.624
Order 4 110.644 104 1.064
Test of Between-Subjects Effects
For a oneway within subjects repeated measures ANOVA, there
is no between-subjects effect because there is
no grouping factor. Nonetheless, a between-subjects test output
is produced, but is simply a test of the intercept
and of no importance or value. Ignore it.
Tests of Between-Subjects Effects
Measure: MEASURE_1
Transformed Variable: Average
Source Type III Sum of
Squares
df Mean Square F Sig. Partial Eta
Squared
Intercept 32097.190 1 32097.190 1974.033 .000 .950
Error 1691.010 104 16.260
21. Begin the write up by describing the context of the research and
the variables. If known, state how each variable
was operationalized, for example: “Overall GPA was measured
on the traditional 4-point scale from 0 (F) to 4
(A)”, or “Satisfaction was measured on a 5-point likert-type
scale from 1 (not at all satisfied) to 5 (extremely
satisfied).” Please pay attention to APA style for reporting scale
anchors (see p. 91 and p. 105 in the 6th edition
of the APA Manual).
Report descriptive statistics such as minimum, maximum, mean,
and standard deviation for each metric
variable. For nominal variables, report percentage for each level
of the variable, for example: “Of the total
sample (N = 150) there were 40 (26.7%) males and 110 (73.3%)
females.” Keep in mind that a sentence that
includes information in parentheticals must still be a sentence
(and make sense) if the parentheticals are
removed. For example: “Of the total sample there were 40 males
and 110 females.”
State the purpose of the analysis or provide the guiding research
question(s). If you use research questions, do
not craft them such that they can be answered with a yes or no.
Instead, craft them so that they will have a
quantitative answer. For example: “What is the strength and
direction of relationship between X and Y?” or
“What is the difference in group means on X between males and
females?”
Present null and alternative hypothesis sets applicable to the
analysis. For repeated measures ANOVA there
would be a hypothesis set for the main effect of the within-
group factor (i.e., mean differences among the
repeated measures).
24. Redux
If the variances of each repeated measure is equal, and if the
covariances (and, thus, correlations) of each pair of
repeated measures is equal, then there is compound symmetry
and, as a result, sphericity is satisfied. When
violated (p < .05), the F test is too liberal (increased Type I
error), incorrectly concluding statistical
significance. In the example output the sphericity assumption
was violated, Mauchly’s W(9, N = 105) = 93.85, p
< .001.
To get a sense of why there was not sphericity, we can examine
the variances (or standard deviations) and pairwise correlations
of
the repeated measures.
From the Descriptive Statistics output we see that the standard
deviations ranged from 1.623 to 2.481. Thus, the variances,
being
the square of the standard deviations, ranged from 2.634 to
6.155,
which, on their face, seem far from being relatively equal.
From the correlation matrix we see
that the pairwise correlations
ranged from .445 (quiz4 with
quiz5) to .858 (quiz1 with quiz3).
The standard deviations of quiz1,
quiz3, and quiz4 appear relatively
equal, and so do their correlations:
.858, .829, and .796. I would
hypothesis that if just these three
were analyzed in repeated
25. measures ANOVA, sphericity
would be satisfied. I tested the
hypothesis and sphericity was
satisfied (see output below),
Mauchly’s W(2, N = 105) = 2.343,
p = .310.
Notice that the Greenhouse-
Geisser and Huynh-Feldt epsilon
values were .978 and .997,
respectively. The maximum
possible value is 1.0, indicating
perfect symmetry.
Mauchly's Test of Sphericitya
Measure: MEASURE_1
Within Subjects Effect Mauchly's W Approx. Chi-
Square
df Sig. Epsilonb
Greenhouse-
Geisser
Huynh-Feldt Lower-bound
threequizzes .978 2.343 2 .310 .978 .997 .500
Descriptive Statistics
Mean Std. Deviation N
quiz1 7.47 2.481 105
28. Mauchly's W Approx. Chi-
Square
df Sig. Epsilonb
Greenhouse-
Geisser
Huynh-Feldt Lower-bound
quizzes .400 93.851 9 .000 .640 .657 .250
Before further examination of Greenhouse-Geisser and Huynh-
Feldt, I want to return to a technical point about
sphericity, itself. Previously, I stated that if there was
compound symmetry (equal variance and covariance),
symmetry would be satisfied. Compound symmetry is a
sufficient but
not necessary condition for sphericity. Even if there is not
compound
symmetry, sphericity is technically tested and satisfied if the
pairwise
difference between the repeated measures have equal variance.
With five repeated measures, there would be 10 such pairwise
differences (quiz1 minus quiz2, quiz1 minus quiz3, etc.). Using
syntax
compute commands I actually calculated the 10 pairwise
difference
variables. The variances of these 10 variables, as shown in the
output at
left, ranged from 1.656 to 4.858, indicating unequal variance as
expected.
29. Back to epsilon and tests of within-subjects effects.
Greenhouse-
Geisser and Huynh-Feldt are adjustments to the df value for the
repeated measures effect (here, the effect of the various quiz
environmental conditions) and the df error value.
Tests of Within-Subjects Effects
Measure: MEASURE_1
Source Type III Sum
of Squares
df Mean Square F Sig. Partial Eta
Squared
Observed
Powera
quizzes
Sphericity Assumed 18.819 4 4.705 3.049 .017 .028 .805
Greenhouse-Geisser 18.819 2.559 7.355 3.049 .037 .028 .662
Huynh-Feldt 18.819 2.629 7.159 3.049 .035 .028 .670
Lower-bound 18.819 1.000 18.819 3.049 .084 .028 .409
Error(quizzes)
Sphericity Assumed 641.981 416 1.543
Greenhouse-Geisser 641.981 266.100 2.413
31. quizzes ( i.e., df = number of repeated measures –
1), and df error = 416 (N – 1 times number of repeated measures
– 1 = 105 – 1 times 5 -1 = 104 x 4 = 416). The
Greenhouse-Geisser correction value (i.e., epsilon value) was
.640. The sphericity assumed effect df times .640
is the corrected effect df for Greenhouse-Geiser (4 x.640 = 2.6).
Similarly, the sphericity assumed error df times
.640 is the corrected error df for Greenhouse-Geiser (416 x .640
= 226.24, within rounding error of the output
value). The Huynh-Feldt df effect and error adjustments are
calculated the same way, but using epsilon value of
.657.
Because the same adjustment (multiplication by a constant
value) is made to both the effect df and the error df,
the F value and partial eta squared are unchanged. What differs
is that significance of the F value is tested using
different df values, so p will not be the same. In this example,
the F value is 3.049. When evaluated at 4 and 416
df (sphericity assumed), p = .017; but when evaluated at 2.559
and 266.100 (Greenhouse-Geisser), p = .037; and
for 2.629 and 273.385 (Huynh-Feldt), p = .035.
Notice that the p values get larger as the epsilon adjustment
value gets smaller. This helps to avoid the increased
risk of Type I error when sphericity is violated. It also,
however, decreases power4
. In the Tests of Within-
Subjects Effects on previous page, I included a power column.
It is highest with sphericity assumed (in this
example .805) and decreased to .670 for Huynh-Feldt and to
.662 for Greenhouse-Geisser (which had the lowest
epsilon value).
32. It should be clear that there can be statistical significance (p <
.05) with sphericity, but the effect may not be
statistically significant when sphericity is violated and F test
adjustments are made.
Greenhouse-Geisser may underestimate epsilon resulting in too
much correction and Huynh-Feldt may
overestimate epsilon resulting in not enough correction. As a
general rule, use Greenhouse-Geisser if its epsilon
value is less than .75, otherwise use Huynh-Feldt. You can also
average the two adjustments by taking the
average of the two p values, even though this is not technically
correct. Technically, you average the two
epsilon values and compute new effect df and error df adjusted
values. In this example, the average of the .640
and .657 epsilon values is .6485. Adjusted effect df would be 4
x .6485 = 2.594, and adjusted df error would be
416 x .6485 = 269.776. Unfortunately you cannot correctly
compute the p value using Excel’s fdist function
because it truncates the df values. Also, I am not aware of any
online calculator that works with decimal df
values.
Finally, recall that quiz2 and quiz3 had equal mean and equal
mean difference from quiz1, but the quiz1:quiz2
pairwise p = .049 and the quiz1:quiz3 pairwise p = .001. Why
the difference? In a nutshell, the quiz1:quiz3
difference had smaller standard error and, all else equal, p
decreases as standard error decreases. More
precisely, pairwise comparisons constitute a t test, where t =
mean difference ÷ standard error.
Pairwise Comparisons
Measure: MEASURE_1
34. the quiz2 scores from the quiz1 scores, and do
similar to create q1minusq3, we can look at the descriptive
statistics for each of the two newly created variables.
The standard error (SE) is a function of the standard deviation
(SD) and the sample size (N), such that
N
SDSE = . Because N = 105 for both variables, the issue boil
downs to differences in the standard deviation.
For the same mean difference and N, the variable with the
smaller standard deviation, in this case, q1minusq3,
will have a larger t value and smaller p value.
Descriptive Statistics
N Mean Std. Deviation Variance
Statistic Statistic Std. Error Statistic Statistic
q1minusq2 105 -.5143 .17909 1.83510 3.368
q1minusq3 105 -.5143 .12559 1.28687 1.656
Valid N (listwise) 105
Conceptually, the differences in quiz1 scores and quiz3 scores
were more homogeneous (less spread out) and
the quiz1 and quiz3 scores, themselves, were more highly
correlated (r[103] = .858), than the quiz1 and quiz2
scores (r[103] = .673. This is visually apparent in the
scatterplots below. The quiz1:quiz2 plot on the left is
more scattered than the quiz1:quiz3 plot on the right.
So, the mystery of why quiz2 and quiz3 had equal mean and
equal mean difference from quiz1, but the