SlideShare a Scribd company logo
1 of 39
Hypothesis Testing
In doing research, one of the most common activities is testing
hypotheses. The Afrobarometer data set below is a survey of
African citizens’ attitudes on democracy, governance, the
economy, and other related topics (www.afrobarometer.org).
Using this data set, you might want to examine hypotheses
related to whether rural and urban citizens differ, on average, in
how much they trust the government. The tables below present
results from an independent samples t-test to examine these
hypotheses using a random sample of 44 participants from the
complete data set. Each respondent’s score is a value between 0
and 15 with a higher score indicating greater trust. You can see
that the mean for the urban group is 7.00 ( SD = 4.17) and the
mean for the rural group is 7.74 ( SD = 4.38). The observed
value of the t-statistic is -.564 and the p-value equals 0.576 (see
the column labeled “Sig. (2-tailed)”).
African Citizens' Attitudes on Democracy
The tables below present results from an independent samples t-
test to examine these hypotheses using a random sample of 44
participants from the complete data set. Each respondent’s score
is a value between 0 and 15 with a higher score indicating
greater trust. You can see that the mean for the urban group is
7.00 ( SD = 4.17) and the mean for the rural group is 7.74
( SD = 4.38). The observed value of the t-statistic is -.564 and
the p-value equals 0.576 (see the column labeled “Sig. (2-
tailed)”).
t
df
Sig.
(2-tailed)
Mean Difference
Std. Error Difference
Trust in Government Index
(higher scores = more trust)
-.564
41
.576
-.73913
1.30978
Group Statistics
Urban or Rural Primary
Sampling Unit
N
Mean
Std. Deviation
Std. Error Mean
Trust in Government Index
(higher scores = more trust)
Urban
20
7.000
4.16754
.93189
Rural
30
7.7391
4.38196
.91370
The p-value is the probability of obtaining a value more
extreme than .564 (less than -.564 or greater than +.564) if you
were to repeat the test with a new sample of data and if the null
hypothesis is true. You will see in this Skill Builder that the p-
valuecan easily be used to make statistical decisions in
hypothesis testing. However, while the p-valueis important in
determining statistical significance, it does not tell the whole
story.
Steps of Hypothesis Testing
To interpret p-values, let's review the key steps in hypothesis
testing. Use the < and > icons to navigate between the steps.
Step 1
State the null and alternative hypotheses
Recall that hypotheses are statements about population
parameters. For the Trust in Government example from the
Afrobarometer data set, the null (HO) and alternative
hypotheses (HA) is seen in the above image.
The Greek letter, µ, indicates a population mean, and the
subscripts indicate levels of the independent variable (“urban”
and “rural”). Here the null is saying that the mean for the urban
population on the Trust In Government variable is the same as
the mean for the rural population. The alternative hypothesis
states that these means are not the same.
One-tailed vs. Two-tailed Tests
One important factor to be aware of is whether the test you are
conducting is one-tailed or two-tailed. So far, the hypotheses
have been written for a two-tailed test, which means that the
alternative hypothesis stated simply that there was a difference
between the means, without specifying the direction of the
difference. In a one-tailed test, the alternative hypothesis does
specify the direction of the difference; that is, it specifies that
one of the means (e.g., urban or rural) is expected to be larger
than the other.
In a one-tailed test, the p-value will be the area in the test
statistic distribution to the right of the observed value if the
alternative hypothesis has an “is greater than” sign, and to the
left of the observed value if the alternative hypothesis has an
“is less than” sign. For example, suppose we had the following
hypothesis test:
In this hypothesis test, the alternative hypothesis HA states that
the mean for the urban population will be greater than the mean
for the rural population. The p-value would, therefore, be
determined by the area to the right of the observed value of the
test statistic using the sampling distribution for the test
statistic.
For a two-tailed test, as is being illustrated with the
Afrobarometer data file, the area beyond the observed value is
doubled to obtain the p-value. The reason for doubling is related
to setting the rejection region for a two-tailed test. For a two-
tailed test, alpha is divided in half (α/2), and the “half-areas”
are used to identify rejection regions in both the upper and
lower tails of the test statistic’s sampling distribution.
The doubling of the area beyond the observed value allows
the p-value to be compared to alpha to test the null hypothesis.
Figure 1
Figure 1 shows the p-value determination for the Afrobarometer
hypothesis test. In the SPSS output, the observed value for
the t-statistic is -0.564. Because the value of t is negative, the
more extreme values of t are considered to be to the left of -
0.564. As shown in Figure 1, the area under the t curve and less
than -0.564 is .288. Because of the two-tailed test, however, the
area is doubled to account for the probability of the test statistic
taking on a value greater than +0.564. Hence the p-value for the
hypothesis test is .576.
Again, if alpha had been set equal to .05, the null hypothesis
would be retained (fail to reject) because .576 is greater than
.05. That is, the data support the position that in the populations
of urban and rural citizens, there is no difference in average
levels of trust in government.
Keep in mind the following important points related to making a
statistical decision and interpreting your p-value:
· bullet
By definition, the p-value is the probability of obtaining a value
for the test statistic as extreme or more extreme than the
observed value if the null hypothesis is true.
· bullet
If the p-value is less than alpha, the null is rejected, and the
result is said to be statistically significant.
· bullet
If the p-value is greater than alpha, then researchers would fail
to reject the null hypothesis.
Statistically Significant Results
The final step in conducting a hypothesis test is to link the
statistical result to the real-world. That is, you need to examine
the practical significance or the meaningfulness of the
statistical result.
If the result of the hypothesis test is to retain the null —that is,
obtain a non-significant result—the researcher has clearly not
identified a meaningful effect. In most hypothesis tests,
retaining the null is not what the researcher is hoping to do.
On the other hand, if you reject the null hypothesis, you will
have a statistically significant result. You are, in essence,
saying that the result is so unlikely under the assumption of the
null being true that the null appears to be false. A false null
hypothesis does not mean, however, that the result is
scientifically or socially important. When a researcher finds a
statistically significant result, knowledge of the research area is
used to decide whether the result is important and meaningful.
Large effects are more often meaningful than small effects, but
there are times when small effects can be important.
Knowledge of the research area is key in making the decision.
Probably the most frequent concern with meaningless
statistically significant results has to do with sample size. With
extremely large sample sizes, hypothesis tests can result in
rejecting the null even though the effect is small and
unimportant from an applied perspective. To understand how
this works, let’s take another look at the Afrobarometer data
set. Participants in the survey were asked whether they agreed
or disagreed with the statement, “People must obey the law.”
Responses were made using a five-point Likert scale:
1
2
3
4
5
strongly disagree
disagree
neither agree nor disagree
agree
strongly agree
Suppose a researcher had wanted to compare the urban and rural
populations and tested the null hypothesis
Ho : μurban = μrural using alpha equal to .05. Unlike the
example above that used a sample of 43 participants, the
following results are based on over 50,000 respondents. As
shown in the following table, the p-value ( Sig (2-tailed)) for
this test is .004.
t
df
Sig (2-tailed)
Mean Difference
Q48b. People must obey the law
Equal variances assumed
-.2892
50125
.004
-.029
Using APA style, the researcher could report that, on average,
the urban population agrees less with the statement than does
the rural population, t (50125) = -2.892, p = .004, d = .027,
95% CI [-.039, -.019].
· bullet
The statement says the t-test was conducted with 50,125 degrees
of freedom or 50,127 participants.
· bullet
The p-value of .004 is less than alpha, so the null hypothesis is
rejected.
· bullet
The d statistic is Cohen’s d, a common measure of effect size.
· bullet
The 95% confidence interval for the difference in population
means does not contain zero, which is consistent with having
rejected the null hypothesis.
There is no doubt the result is statistically significant, but how
meaningful is it? The d-statistic is quite useful because it
compares the difference in sample means to an average of the
standard deviations for the two groups. (The average standard
deviation is based on a weighted average of the two sample
variances.) According to Cohen, d = .2 is generally considered a
small effect, d = .5 a medium effect, and d = .8 a large effect.
The value of .027 is little more than 10% of a small effect. The
statistically significant result that was obtained is therefore not
likely to be important.
Statistical Power
Statistical power is the probability of rejecting a null hypothesis
if the null is false (i.e., the alternative is true). It is the degree
to which the researcher is able to detect an effect if there
actually is one. With low statistical power, a researcher may
struggle to detect an effect (to reject the null), even if an effect
actually occurs in the population.
Suppose you are planning an experiment involving stereotype
threat. Stereotype threat is defined as a tendency to behave in a
manner consistent with negative beliefs that others have about a
racial or gender group. For example, if some black test takers
are told that as a group, black test takers do not perform well on
math tests, performance among those black test-takers is worse
than for black test takers for whom the stereotype is not evoked.
One question you will need to answer is how many participants
should you include in your study to be confident in identifying
the effect? In other words, how many participants do you need
in order to have adequate statistical power in your study?
The Affect of Statistical Power
Understanding how several factors affect the statistical power
of a study will help you to understand and critique research
findings and will also lead to greater satisfaction with your own
research. When conducting your own research studies, you
should do a power analysis prior to collecting data to make sure
you have a good chance of demonstrating the effect you are
looking for.
There are three main factors that affect how much statistical
power you have in your study:
· 1
1
Alpha (i.e., the probability of a type I error)
· 2
2
Effect size (i.e., the difference between the population means
for the experimental and control groups)
· 3
3
Sample size (i.e., n )
As a researcher, you have control over alpha and sample size.
The effect size, however, is not under your control and is
predetermined. What will be important to you is having an idea
about how great the effect may be. This Skill Builder is
concerned with how alpha, effect size, and sample size are
related to statistical power.
A Review of Hypothesis Testing
Before discussing power, let’s review the basics of hypothesis
testing:
· bullet
The null hypothesis is the statement of no effect.
· bullet
The alternative hypothesis is a statement that an effect exists in
the population.
· bullet
Obtaining a significant result means that you have rejected the
null hypothesis and have concluded that it’s likely that
there is an effect in the population.
· bullet
A type I error happens when the null hypothesis is true but you
reject it erroneously. This is referred to as a false positive.
· bullet
A type II error happens when the null hypothesis is false but
you fail to reject it. This is referred to as a false negative.
Reviewing Type I and Type II Errors
Type I and type II errors and their probabilities are important
concepts when thinking about hypothesis testing. These error
events are called “conditional,” meaning that the events can
only occur under certain conditions.
The following is the language that is used to talk about these
conditional events:
· Alpha (α) = P(type I error) = P(Reject H 0 |H 0 is true) which
is read as the probability of a type I error equals the probability
of rejecting the null hypothesis given the null is true.
· Beta (β) = P(type II error) = P(Retain H 0 |H A is true) which
is read as the probability of a type II error equals the
probability of retaining the null hypothesis given the alternative
hypothesis is true.
Table 1 shows the possible outcomes for a hypothesis test.
Table 1: Possible Outcomes for a Hypothesis Test
D
True State of Nature
Decision
Ho is true
Ho is false
Retain Ho
Correct decision
Type II error
Reject Ho
Type 1 error
Correct decision
Power Analysis
Power analysis is the process of examining a test of the null
hypothesis to determine the chances of rejecting it and placing
belief in the alternative hypothesis.
Researchers typically want to get a sense of how much
statistical power they will have in their study before coll ecting
data. In order to do so, they usually conduct a power analysis.
Suppose you design a study, and a part of it is to demonstrate
stereotype threat involving females. Nguyen and Ryan (2008)
provide results that indicate the average Cohen’s d in previous
studies of gender-based stereotype threat for cognitive tests is
about .21. This means that over many studies, females who are
NOT made aware of a gender stereotype (NOT primed) score
about 0.2 standard deviations higher on cognitive tests than
females who are made aware of a gender effect (primed). To
demonstrate this effect in your study, you will test the
following null hypothesis:
HA : μNOT primed − μprimed ≤ 0
If you reject the null, you will place your confidence in the
following alternative hypothesis:
HA : μNOT primed − μprimed > 0
μNOT primed
Indicates the population mean for the “not primed” condition.
μprimed
Indicates the population mean for the “primed” condition.
HA : μNOT primed − μprimed > 0
The alternative hypothesis specifies that the “not primed”
condition will score higher than the “primed” condition.
To test this null hypothesis, you would examine a test statistic
distribution and note the area in the upper tail of the
distribution equal to alpha. Suppose you plan to test this
hypothesis with a t-test with 50 participants in each condition
(primed or NOT primed).
Figure 1 sampling distribution shows what you should expect
for the values of the test statistic if the null hypothesis is true.
In order to reject the null hypothesis, the t value would need to
be greater than 1.66055.
Figure 1
Because the test statistic is a continuous variable, the curve
shows probability density, and probability is found by
determining the area under the curve.
The entire area under the curve, between - ∞and+ ∞, is 1.00.
To find the probability of a statistic taking on a value within a
certain range, you need to find the area under the curve within
the range. For example, there are tables that will tell you that
the area under the curve between t = 0 and t = +1 corresponds to
a probability of about .34. Most importantly, because alpha has
been set equal to .05, the area beyond 1.66 corresponds to a
probability of .05. Fortunately, statistical programs calculate
the areas for you, and you do not need to do the calculations
yourself.
Nevertheless, the essence of hypothesis testing is that if you
obtain a value of t greater than 1.66, you will say, “This is not a
very likely event if the null is true. Thus, the null hypothesis is
probably not true because the alternative hypothesis provides a
more likely explanation.” In making the decision to reject the
null, however, you recognize that if the null is, in fact, true, you
are making a type I error.
While alpha provides assurance that the researcher has a small
chance of making a type I error, you are also interested in what
will happen if the null hypothesis is false—the real world
expectation that is driving you to do the study.
Figure 2
Now, in Figure 2, switch your focus from the curve on the left
and attend to the curve on the right formed by the dashed line.
This curve is based on the alternative hypothesis (i.e., that the
unprimed group performs better than the primed group).
To construct this curve based on the alternative hypothesis, a
specific value for the difference in means had to be specified; in
this case, the value of d = .21, the overall gender effect that
Nguyen and Ryan (2008) found. Note, again, that the vertical
line with t = 1.66 separates the values of the test statistic that
lead to rejecting versus retaining the null hypothesis, and that
the line is based on the null hypothesis. The statistical power of
the test, (1-β), is the area under the curve with the dashed lines
and to the right of the vertical line for t = 1.69. The area
designated by beta (β), to the left of the vertical line,
corresponds to the probability of a type II error, retaining the
null if the null is actually false.
In this example, note that the area corresponding to power (1-β)
is less than the area corresponding to β. Hence, you can
conclude that the power is less than 0.5 because the sum of the
two areas is 1.0. Almost always, you would like statistical
power to be greater than beta for the important hypothesis tests
in your study. In this example, a plan to do an experiment with
50 participants in each group may be doomed. The statistical
power of the test (.27) is relatively low, and the risk of making
a type II error is relatively high. In other words, the statistical
power of the test, as currently constructed, limits your ability to
detect a gender effect of priming versus not priming if there is
one.
Numbered divider 1
Consider the following scenario when answering the question
below.
You are planning a study of stereotype threat and are concerned
you may not be able to detect a significant result, even though
you believe your experimental procedures should induce the
stereotype threat effect.
Hint: A type I error happens when the null hypothesis is true
and you reject it.
Which of the following errors are you concerned about?
Type I error
Type II error
TAKE AGAIN
The Relationship Between Power and Sample Size
Prior discussions have focused on testing hypotheses about
population means, but you can also do hypothesis tests
involving population proportions. In general, larger sample
sizes give you more information to pin down the true nature of
the population. You can, therefore, expect the sample mean
and sample proportion obtained from a larger sample to be
closer to the population mean and proportion, respectively.
As a result, for the same level of confidence, you can report a
smaller margin of error, and get a narrower confidence interval.
In other words, larger sample sizes increase how much you trust
your sample results. In the two scenarios below, you will see
that a larger sample size results in a greater ability to reject the
null when an effect actually exists in the population.
Scenario: Examining Marijuana Use
Imagine you are a researcher examining marijuana use at a
certain liberal arts college and read through the scenario below.
Step 1
You believe that marijuana use at the college is greater than the
national average, for which large-scale studies have shown that
about 15.7% of college students use marijuana (reported by the
Harvard School of Public Health). Based on this belief, you
perform the hypothesis test shown in Figure 9 below.
· Note that p in this figure means population proportion
and pˆ means sample proportion. On the other hand, p-value
continues to have the same meaning as defined in the glossary.
Because the p-value is greater than .05, the customary alpha
level, the data do not provide enough evidence that the
proportion of marijuana users at the college is higher than the
proportion among all U.S. college students, which is .157.
Step 2
Let’s make some small changes to the above problem. Suppose
that in a simple random sample of 400 students from the
college, 76 admitted to marijuana use as seen in Figure 8 below.
Do the data provide enough evidence to conclude that the
proportion of marijuana users among the students in the college
(p) is higher than the national proportion, which is .157?
Step 3
You now have a larger sample (400 instead of 100), and also the
number of marijuana users is 76 instead of 19. The question of
interest did not change, so if you carry out the test in this case,
you are testing the same hypotheses seen below.
Step 4
You select a random sample of size 400 and find that 76 are
marijuana users, and the formula seen below. This is the same
sample proportion as in the original problem, so it seems that
the data give the same evidence.
Step 5
However, when you calculate the test statistic, you see that
actually this is not the case as seen in the formula below.
Even though the sample proportion is the same (.19), because
here it is based on a larger sample (400 instead of 100), it is
1.81 standard deviations above the null value of .157 (as
opposed to .91 standard deviations in the original problem). The
sampling distribution for the sample proportion has a smaller
standard error because of the larger sample size.
Step 6
The p-value here is .035, as opposed to .182 in the original
problem. In other words, when Ho is true (i.e., if p = .157 at the
certain college), it is quite unlikely (probability of .035) to get
a sample proportion of .19 or higher based on a sample of size
400. When the sample size is 100, the probability of having a
sample proportion greater than .19 is more likely (probability
.182).
The results here are important. With n = 400, the data provide
enough evidence to reject Ho and conclude that the proportion
of marijuana users at the college is higher than among all U.S.
students. With n = 100, however, the evidence is insufficient to
reject the null. Figure 9 summarizes these findings.
You can see that results that are based on a larger sample carry
more weight. A sample proportion of .19 based on a sample of
size of 100 was not enough evidence that the proportion of
marijuana users in the college is higher than .157. Recall that
this conclusion (not having enough evidence to reject the null
hypothesis) doesn't mean the null hypothesis is necessarily true;
it only means that the particular study did not yield sufficient
evidence to reject the null. It might be that the sample size was
simply too small to detect a statistically significant difference,
and a type II error was made.
To summarize, you saw that when the sample proportion of .19
is obtained from a sample of size 400, it carries much more
weight, and in particular, provides enough evidence that the
proportion of marijuana users in the college is higher than .157
(the national figure). In this case, the sample size of
400 was large enough to detect a statistically significant
difference.
The following graphs show the power of the two tests if the
population mean proportion p for the certain college is actually
.19. Use the < and > icon to navigate between slides.
· 1
· 2
Figure 10
Figure 11
Figure 12
Finally, Figure 12 shows how sample size affects the test for
proportions concerning marijuana use at the liberal arts college.
The graph is based on a hypothesis test with alpha = .05, the
proportion for the null hypothesis equal to .157, and the
population proportion for the liberal arts college = .19.
In general, whether you are testing hypotheses about
proportions, means, or other parameters, the larger the sample
size, the greater the statistical power. Because of your interest
in rejecting the null, you need to pay attention to how large
your sample size will be prior to collecting data.
© 2016 Laureate Education, Inc. Page 1 of 2
Week 5
Scenarios
1. The p-value was slightly above conventional threshold, but
was described as
“rapidly approaching significance” (i.e., p =.06).
An independent samples t test was used to determine whether
student satisfaction
levels in a quantitative reasoning course differed between the
traditional classroom
and on-line environments. The samples consisted of students in
four face-to-face
classes at a traditional state university (n = 65) and four online
classes offered at
the same university (n = 69). Students reported their level of
satisfaction on a five-
point scale, with higher values indicating higher levels of
satisfaction. Since the
study was exploratory in nature, levels of significance were
relaxed to the .10 level.
The test was significant t(132) = 1.8, p = .074, wherein students
in the face-to-face
class reported lower levels of satisfaction (M = 3.39, SD = 1.8)
than did those in the
online sections (M = 3.89, SD = 1.4). We therefore conclude
that on average,
students in online quantitative reasoning classes have higher
levels of satisfaction.
The results of this study are significant because they provide
educators with
evidence of what medium works better in producing
quantitatively knowledgeable
practitioners.
2. A results report that does not find any effect and also has
small sample size
(possibly no effect detected due to lack of power).
A one-way analysis of variance was used to test whether a
relationship exists
between educational attainment and race. The dependent
variable of education
was measured as number of years of education completed. The
race factor had
three attributes of European American (n = 36), African
American (n = 23) and
Hispanic (n = 18). Descriptive statistics indicate that on
average, European
Americans have higher levels of education (M = 16.4, SD =
4.6), with African
Americans slightly trailing (M = 15.5, SD = 6.8) and Hispanics
having on average
lower levels of educational attainment (M = 13.3, SD = 6.1).
The ANOVA was not
significant F (2,74) = 1.789, p = .175, indicating there are no
differences in
educational attainment across these three races in the
population. The results of
this study are significant because they shed light on the current
social conversation
about inequality.
3. Statistical significance is found in a study, but the effect in
reality is very small (i.e.,
there was a very minor difference in attitude between men and
women). Were the
results meaningful?
An independent samples t test was conducted to determine
whether differences
exist between men and women on cultural competency scores.
The samples
consisted of 663 women and 650 men taken from a convenience
sample of public,
private, and non-profit organizations. Each participant was
administered an
instrument that measured his or her current levels of cultural
competency. The
© 2016 Laureate Education, Inc. Page 2 of 2
cultural competency score ranges from 0 to 10, with higher
scores indicating higher
levels of cultural competency. The descriptive statistics indicate
women have
higher levels of cultural competency (M = 9.2, SD = 3.2) than
men (M = 8.9, SD =
2.1). The results were significant t (1311) = 2.0, p <.05,
indicating that women are
more culturally competent than are men. These results tell us
that gender-specific
interventions targeted toward men may assist in bolstering
cultural competency.
4. A study has results that seem fine, but there is no clear
association to social
change. What is missing?
A correlation test was conducted to determine whether a
relationship exists
between level of income and job satisfaction. The sample
consisted of 432
employees equally represented across public, private, and non-
profit sectors. The
results of the test demonstrate a strong positive correlation
between the two
variables, r =.87, p < .01, showing that as level of income
increases, job
satisfaction increases as well.
Assignment: Evaluating Significance of Findings
Part of your task as a scholar-practitioner is to act as a critical
consumer of research and ask informed questions of published
material. Sometimes, claims are made that do not match the
results of the analysis. Unfortunately, this is why statistics is
sometimes unfairly associated with telling lies. These
misalignments might not be solely attributable to statistical
nonsense, but also “user error.” One of the greatest areas of user
error is within the practice of hypothesis testing and
interpreting statistical significance. As you continue to consume
research, be sure and read everything with a critical eye and call
out statements that do not match the results.
For this Assignment, you will examine statistical significance
and meaningfulness based on sample statements.
To prepare for this Assignment:
· Review the Week 5 Scenarios found in this week’s Learning
Resources and select two of the four scenarios for this
Assignment.
· For additional support, review the Skill Builder: Evaluating P
Values and the Skill Builder: Statistical Power, which you can
find by navigating back to your Blackboard Course Home Page.
From there, locate the Skill Builder link in the left navigation
pane.
For this Assignment:
Critically evaluate the two scenarios you selected based upon
the following points:
· Critically evaluate the sample size.
· Critically evaluate the statements for meaningfulness.
· Critically evaluate the statements for statistical significance.
· Based on your evaluation, provide an explanation of the
implications for social change.
Use proper APA format and citations, and referencing.
https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf
Frankfort-Nachmias, C., Leon-Guerrero, A., & Davis, G.
(2020). Social statistics for a diverse society (9th ed.).
Thousand Oaks, CA: Sage Publications.
· Chapter 8, “Testing Hypothesis: Assumptions of Statistical
Hypothesis Testing” (pp. 241-242)
Wagner, III, W. E. (2020). Using IBM® SPSS® statistics for
research methods and social science statistics (7th ed.) .
Thousand Oaks, CA: Sage Publications.
· Chapter 6, “Testing Hypotheses Using Means and Cross-
Tabulation”
https://content.waldenu.edu/content/dam/laureate/laureate-
academics/wal/xx-rsch/rsch-
8210/readings/USW1_RSCH_8210_Week05_Warner_chapter03.
pdf
Walden University, LLC. (Producer). (2016f). Meaningfulness
vs. statistical significance [Video file]. Baltimore, MD: Author.
Note: The approximate length of this media piece is 4 minutes.
In this media program, Dr. Matt Jones discusses the differences
in meaningfulness and statistical significance. Focus on how
this information will inform your Discussion and Assignment
for this week.
Skill builder: Evaluating P Values
Hypothesis Testing
In doing research, one of the most common activities is testing
hypotheses. The Afrobarometer data set below is a survey of
African citizens’ attitudes on democracy, governance, the
economy, and other related topics (www.afrobarometer.org).
Using this data set, you might want to examine hypotheses
related to whether rural and urban citizens differ, on average, in
how much they trust the government. The tables below present
results from an independent samples t-test to examine these
hypotheses using a random sample of 44 participants from the
complete data set. Each respondent’s score is a value between 0
and 15 with a higher score indicating greater trust. You can see
that the mean for the urban group is 7.00 ( SD = 4.17) and the
mean for the rural group is 7.74 ( SD = 4.38). The observed
value of the t-statistic is -.564 and the p-value equals 0.576 (see
the column labeled “Sig. (2-tailed)”).
African Citizens' Attitudes on Democracy
The tables below present results from an independent samples t-
test to examine these hypotheses using a random sample of 44
participants from the complete data set. Each respondent’s score
is a value between 0 and 15 with a higher score indicating
greater trust. You can see that the mean for the urban group is
7.00 ( SD = 4.17) and the mean for the rural group is 7.74
( SD = 4.38). The observed value of the t-statistic is -.564 and
the p-value equals 0.576 (see the column labeled “Sig. (2-
tailed)”).
t
df
Sig.
(2-tailed)
Mean Difference
Std. Error Difference
Trust in Government Index
(higher scores = more trust)
-.564
41
.576
-.73913
1.30978
Group Statistics
Urban or Rural Primary
Sampling Unit
N
Mean
Std. Deviation
Std. Error Mean
Trust in Government Index
(higher scores = more trust)
Urban
20
7.000
4.16754
.93189
Rural
30
7.7391
4.38196
.91370
The p-value is the probability of obtaining a value more
extreme than .564 (less than -.564 or greater than +.564) if you
were to repeat the test with a new sample of data and if the null
hypothesis is true. You will see in this Skill Builder that the p-
valuecan easily be used to make statistical decisions in
hypothesis testing. However, while the p-valueis important in
determining statistical significance, it does not tell the whole
story.
Steps of Hypothesis Testing
To interpret p-values, let's review the key steps in hypothesis
testing. Use the < and > icons to navigate between the steps.
Step 1
State the null and alternative hypotheses
Recall that hypotheses are statements about population
parameters. For the Trust in Government example from the
Afrobarometer data set, the null (HO) and alternative
hypotheses (HA) is seen in the above image.
The Greek letter, µ, indicates a population mean, and the
subscripts indicate levels of the independent variable (“urban”
and “rural”). Here the null is saying that the mean for the urban
population on the Trust In Government variable is the same as
the mean for the rural population. The alternative hypothesis
states that these means are not the same.
Step 2
Set alpha , the probability of a type I error
Frequently, the value of alpha is set equal to 0.05, although
researchers are free to use other values. If using an alpha of .05,
then researchers are specifying that there is a 5% chance that
they will reject the null when, in fact, it should not be rejected.
Setting alpha at .05 is popular because there is relatively
minimal risk of making a type I error, and alpha is not so small
that researchers greatly increase their risk of not rejecting the
null when they actually should (a type II error). So in setting
alpha, researchers have to be aware of both the risk of r ejecting
the null erroneously and of not rejecting it when they actually
should. For our Afrobarometer example, we will set alpha at
.05.
Step 1
State the null and alternative hypotheses
Recall that hypotheses are statements about population
parameters. For the Trust in Government example from the
Afrobarometer data set, the null (HO) and alternative
hypotheses (HA) is seen in the above image.
The Greek letter, µ, indicates a population mean, and the
subscripts indicate levels of the independent variabl e (“urban”
and “rural”). Here the null is saying that the mean for the urban
population on the Trust In Government variable is the same as
the mean for the rural population. The alternative hypothesis
states that these means are not the same.
Step 2
Set alpha , the probability of a type I error
Frequently, the value of alpha is set equal to 0.05, although
researchers are free to use other values. If using an alpha of .05,
then researchers are specifying that there is a 5% chance that
they will reject the null when, in fact, it should not be rejected.
Setting alpha at .05 is popular because there is relatively
minimal risk of making a type I error, and alpha is not so small
that researchers greatly increase their risk of not rejecting the
null when they actually should (a type II error). So in setting
alpha, researchers have to be aware of both the risk of rejecting
the null erroneously and of not rejecting it when they actually
should. For our Afrobarometer example, we will set alpha at
.05.
Step 3
Decide on a test statistic
Because of a desire to compare two groups (rural and urban),
a t-test for two independent samples is being used.
Step 1
State the null and alternative hypotheses
Recall that hypotheses are statements about population
parameters. For the Trust in Government example from the
Afrobarometer data set, the null (HO) and alternative
hypotheses (HA) is seen in the above image.
The Greek letter, µ, indicates a population mean, and the
subscripts indicate levels of the independent variable (“urban”
and “rural”). Here the null is saying that the mean for the urban
population on the Trust In Government variable is the same as
the mean for the rural population. The alternative hypothesis
states that these means are not the same.
Step 2
Set alpha , the probability of a type I error
Frequently, the value of alpha is set equal to 0.05, although
researchers are free to use other values. If using an alpha of .05,
then researchers are specifying that there is a 5% chance that
they will reject the null when, in fact, it should not be rejected.
Setting alpha at .05 is popular because there is relatively
minimal risk of making a type I error, and alpha is not so small
that researchers greatly increase their risk of not rejecting the
null when they actually should (a type II error). So in setting
alpha, researchers have to be aware of both the risk of rejecting
the null erroneously and of not rejecting it when they actually
should. For our Afrobarometer example, we will set alpha at
.05.
Step 3
Decide on a test statistic
Because of a desire to compare two groups (rural and urban),
a t-test for two independent samples is being used.
7
Step 4
Collect the data and examine the model assumptions
Before calculating the value for your test statistic, be sure you
have checked assumptions, like homogeneity of variance and
the absence of outliers.
Step 5
Calculate the observed value of the test statistic
Once the data have been collected, the observed value of the
test statistic will be used to make a statistical decision. In the
Afrobarometer example, the observed value of the test statistic
is -.564, sometimes written as tobserved(41)= −.564 where the
41 is the number of degrees of freedom associated with the test.
Step 1
State the null and alternative hypotheses
Recall that hypotheses are statements about population
parameters. For the Trust in Government example from the
Afrobarometer data set, the null (HO) and alternative
hypotheses (HA) is seen in the above image.
The Greek letter, µ, indicates a population mean, and the
subscripts indicate levels of the independent variable (“urban”
and “rural”). Here the null is saying that the mean for the urban
population on the Trust In Government variable is the same as
the mean for the rural population. The alternative hypothesis
states that these means are not the same.
Step 2
Set alpha , the probability of a type I error
Frequently, the value of alpha is set equal to 0.05, although
researchers are free to use other values. If using an alpha of .05,
then researchers are specifying that there is a 5% chance that
they will reject the null when, in fact, it should not be rejected.
Setting alpha at .05 is popular because there is relatively
minimal risk of making a type I error, and alpha is not so small
that researchers greatly increase their risk of not rejecting the
null when they actually should (a type II error). So in setting
alpha, researchers have to be aware of both the risk of rejecting
the null erroneously and of not rejecting it when they actually
should. For our Afrobarometer example, we will set alpha at
.05.
Step 3
Decide on a test statistic
Because of a desire to compare two groups (rural and urban),
a t-test for two independent samples is being used.
Step 4
Collect the data and examine the model assumptions
Before calculating the value for your test statistic, be sure you
have checked assumptions, like homogeneity of variance and
the absence of outliers.
Step 5
Calculate the observed value of the test statistic
Once the data have been collected, the observed value of the
test statistic will be used to make a statistical decision. In the
Afrobarometer example, the observed value of the test statistic
is -.564, sometimes written as tobserved(41)= −.564 where the
41 is the number of degrees of freedom associated with the test.
Step 6
Make a statistical decision using the observed value
This decision requires examining the distribution of the test
statistic under the assumption the null hypothesis is true.
Practically, the area in the tail of the distribution beyond the
observed value of the test statistic, called the p-value, needs to
be determined (see the figure above). Fortunately, computer
programs can do the calculation of the area quickly and easily.
If the probability is less than alpha (e.g., .05), we will reje ct the
null hypothesis. Thus, if you set alpha equal to .05 and the p-
value for your test statistic is any value less than .05, you will
reject the null hypothesis. Otherwise, retain the null.
Step 7
Make a real-world decision
The statistical decision is focused on the abstract hypothesis
test. The final step is to examine the implications of the
statistical decision in the real world. You will need to consider
whether your results are practically significant. It turns out that
not all statistically significant results are important in the real
world. We will discuss more about this later in the Skill
Builder.
Skill Builder: Statistical Power,
Statistical Power
Statistical power is the probability of rejecting a null hypothesis
if the null is false (i.e., the alternative is true). It is the degree
to which the researcher is able to detect an effect if there
actually is one. With low statistical power, a researcher may
struggle to detect an effect (to reject the null), even if an effect
actually occurs in the population.
Suppose you are planning an experiment involving stereotype
threat. Stereotype threat is defined as a tendency to behave in a
manner consistent with negative beliefs that others have about a
racial or gender group. For example, if some black test takers
are told that as a group, black test takers do not perform well on
math tests, performance among those black test-takers is worse
than for black test takers for whom the stereotype is not evoked.
One question you will need to answer is how many participants
should you include in your study to be confident in identifying
the effect? In other words, how many participants do you need
in order to have adequate statistical power in your study?
The Affect of Statistical Power
Understanding how several factors affect the statistical power
of a study will help you to understand and critique research
findings and will also lead to greater satisfaction with your own
research. When conducting your own research studies, you
should do a power analysis prior to collecting data to make sure
you have a good chance of demonstrating the effect you are
looking for.
There are three main factors that affect how much statistical
power you have in your study:
· 1
1
Alpha (i.e., the probability of a type I error)
· 2
2
Effect size (i.e., the difference between the population means
for the experimental and control groups)
· 3
3
Sample size (i.e., n )
As a researcher, you have control over alpha and sample size.
The effect size, however, is not under your control and is
predetermined. What will be important to you is having an idea
about how great the effect may be. This Skill Builder is
concerned with how alpha, effect size, and sample size are
related to statistical power.
A Review of Hypothesis Testing
Before discussing power, let’s review the basics of hypothesis
testing:
· bullet
The null hypothesis is the statement of no effect.
· bullet
The alternative hypothesis is a statement that an effect exists in
the population.
· bullet
Obtaining a significant result means that you have rejected the
null hypothesis and have concluded that it’s likely that
there is an effect in the population.
· bullet
A type I error happens when the null hypothesis is true but you
reject it erroneously. This is referred to as a false positive.
· bullet
A type II error happens when the null hypothesis is false but
you fail to reject it. This is referred to as a false negative.
Reviewing Type I and Type II Errors
Type I and type II errors and their probabilities are important
concepts when thinking about hypothesis testing. These error
events are called “conditional,” meaning that the events can
only occur under certain conditions.
The following is the language that is used to talk about these
conditional events:
· Alpha (α) = P(type I error) = P(Reject H 0 |H 0 is true) which
is read as the probability of a type I error equals the probability
of rejecting the null hypothesis given the null is true.
· Beta (β) = P(type II error) = P(Retain H 0 |H A is true) which
is read as the probability of a type II error equals the
probability of retaining the null hypothesis given the alternative
hypothesis is true.
Table 1 shows the possible outcomes for a hypothesis test.
Table 1: Possible Outcomes for a Hypothesis Test
D
True State of Nature
Decision
Ho is true
Ho is false
Retain Ho
Correct decision
Type II error
Reject Ho
Type 1 error
Correct decision
Power Analysis
Power analysis is the process of examining a test of the null
hypothesis to determine the chances of rejecting it and placing
belief in the alternative hypothesis.
Researchers typically want to get a sense of how much
statistical power they will have in their study before collecting
data. In order to do so, they usually conduct a power analysis.
Suppose you design a study, and a part of it is to demonstrate
stereotype threat involving females. Nguyen and Ryan (2008)
provide results that indicate the average Cohen’s d in previous
studies of gender-based stereotype threat for cognitive tests is
about .21. This means that over many studies, females who are
NOT made aware of a gender stereotype (NOT primed) score
about 0.2 standard deviations higher on cognitive tests than
females who are made aware of a gender effect (primed). To
demonstrate this effect in your study, you will test the
following null hypothesis:
HA : μNOT primed − μprimed ≤ 0
If you reject the null, you will place your confidence in the
following alternative hypothesis:
HA : μNOT primed − μprimed > 0
μNOT primed
Indicates the population mean for the “not primed” condition.
μprimed
Indicates the population mean for the “primed” condition.
HA : μNOT primed − μprimed > 0
The alternative hypothesis specifies that the “not primed”
condition will score higher than the “primed” condition.
To test this null hypothesis, you would examine a test statistic
distribution and note the area in the upper tail of the
distribution equal to alpha. Suppose you plan to test this
hypothesis with a t-test with 50 participants in each condition
(primed or NOT primed).
Figure 1 sampling distribution shows what you should expect
for the values of the test statistic if the null hypothesis is true.
In order to reject the null hypothesis, the t value would need to
be greater than 1.66055.
Figure 1
Because the test statistic is a continuous variable, the cur ve
shows probability density, and probability is found by
determining the area under the curve.
The entire area under the curve, between - ∞and+ ∞, is 1.00.
To find the probability of a statistic taking on a value within a
certain range, you need to find the area under the curve within
the range. For example, there are tables that will tell you that
the area under the curve between t = 0 and t = +1 corresponds to
a probability of about .34. Most importantly, because alpha has
been set equal to .05, the area beyond 1.66 corresponds to a
probability of .05. Fortunately, statistical programs calculate
the areas for you, and you do not need to do the calculations
yourself.
Nevertheless, the essence of hypothesis testing is that if you
obtain a value of t greater than 1.66, you will say, “This is not a
very likely event if the null is true. Thus, the null hypothesis is
probably not true because the alternative hypothesis provides a
more likely explanation.” In making the decision to reject the
null, however, you recognize that if the null is, in fact, true, you
are making a type I error.
While alpha provides assurance that the researcher has a small
chance of making a type I error, you are also interested in what
will happen if the null hypothesis is false—the real world
expectation that is driving you to do the study.
Figure 2
Now, in Figure 2, switch your focus from the curve on the left
and attend to the curve on the right formed by the dashed line.
This curve is based on the alternative hypothesis (i.e., that the
unprimed group performs better than the primed group).
To construct this curve based on the alternative hypothesis, a
specific value for the difference in means had to be specified; in
this case, the value of d = .21, the overall gender effect that
Nguyen and Ryan (2008) found. Note, again, that the vertical
line with t = 1.66 separates the values of the test statistic that
lead to rejecting versus retaining the null hypothesis, and that
the line is based on the null hypothesis. The statistical power of
the test, (1-β), is the area under the curve with the dashed lines
and to the right of the vertical line for t = 1.69. The area
designated by beta (β), to the left of the vertical line,
corresponds to the probability of a type II error, retaining the
null if the null is actually false.
In this example, note that the area corresponding to power (1-β)
is less than the area corresponding to β. Hence, you can
conclude that the power is less than 0.5 because the sum of the
two areas is 1.0. Almost always, you would like statistical
power to be greater than beta for the important hypothesis tests
in your study. In this example, a plan to do an experiment with
50 participants in each group may be doomed. The statistical
power of the test (.27) is relatively low, and the risk of making
a type II error is relatively high. In other words, the statistical
power of the test, as currently constructed, limits your ability to
detect a gender effect of priming versus not priming if there is
one.
Power Analysis
As the researcher, you have control of alpha, and you will set
alpha when you are planning your study. Continuing with the
example from the previous page, Figure 4 below shows what
would happen to power if you change alpha, the probability of a
type I error, to .15.
Figure 2
Figure 2
Compare the curves in Figure 3 to the ones above in Figure 2; in
that figure, alpha (α) was equal to .05. Notice that β becomes
smaller, and power, (1-β), becomes larger. If you change α to
.01, a relatively small value for the probability of a type I error,
beta (β) becomes larger, and power becomes less. See Figure 4
below.
Figure 3
Figure 3
Figure 4
Figure 4
· bullet
In general, making alpha (α) smaller results in a decrease in the
power of the statistical test, and making alpha larger results in
greater power. This is because if you set a more stringent alpha
(e.g., .01 instead of .05), it becomes more difficult to reject the
null hypothesis. While .05 is a typical value for α, the decision
of which value to use for α is up to the researcher. Letting alpha
(α) equal .05 is certainly common practice.
· bullet
Many journal editors expect alpha (α) to equal .05. There are
other times, however, when the researcher may wish to use a
different value for alpha (α) depending on the severity of the
consequences for making a type I error. For example, if you are
studying whether or not a drug has serious side effects, with the
null specifying that there are no serious side effects, you may
want to have a more stringent alpha to lower your risk of saying
that there aren’t side effects when there actually are; you may
opt for a .01 alpha instead of a .05 alpha.
Power and Effect Size
A second factor that is related to the statistical power of a test
is the effect size. There are several measures of effect size.
With a comparison of two populations, Cohen’s d is often used.
The value of d is the difference in population means between
two groups in standard deviation units. According to Cohen’s
rule of thumb, a value of d = .2 is considered a small effect, d =
.5 is considered a medium sized effect, and d = .8 is considered
a large effect.
Let’s revisit the earlier example about planning a study to
demonstrate race-based stereotype threat. Nguyen and Ryan
(2008) note that overall race-based stereotype threat studies
have resulted in an average d equal to about .32. Figure 5 below
shows what you can expect if you induced a general racial
stereotype threat in a rather typical way so that in the
population d = .32, there are 50 participants in each group, and
alpha = .05. Note that power has increased noticeably compared
to the study examined in Figure 2. This is due to the effect size
( d = .32) in this figure being larger than the effect size ( d =
.21) in Figure 2.
Figure 5
Figure 5
There are instances in which stereotype effects as large as d =
.64 have been identified in the samples being studied. If the
population d is .64, the hypothesis test with alpha = .05 and 50
participants in each group will result in power equal to .93 as
shown in Figure 6. This is a high value for statistical power,
meaning that the researchers are very likely to detect an effect
if d = .64 in the population.
Figure 6
Figure 6
Most researchers prefer to have the estimate of power be at least
.80 before they are willing to conduct a study. So planning to do
a study with 50 participants in each group may be a bad
decision if the effect size in the population is small or
moderate, as it was above in Figures 2 and 5. On the other hand,
with a large effect (e.g., d = .64), a sample of 50 participants in
each condition provides more than sufficient statistical power
for most researchers.
The Relationship Between Power and Sample Size
Prior discussions have focused on testing hypotheses about
population means, but you can also do hypothesis tests
involving population proportions. In general, larger sample
sizes give you more information to pin down the true nature of
the population. You can, therefore, expect the sample mean
and sample proportion obtained from a larger sample to be
closer to the population mean and proportion, respectively.
As a result, for the same level of confidence, you can report a
smaller margin of error, and get a narrower confidence interval.
In other words, larger sample sizes increase how much you trust
your sample results. In the two scenarios below, you will see
that a larger sample size results in a greater ability to reject the
null when an effect actually exists in the population.
Scenario: Examining Marijuana Use
Imagine you are a researcher examining marijuana use at a
certain liberal arts college and read through the scenario below.
Step 1
You believe that marijuana use at the college is greater than the
national average, for which large-scale studies have shown that
about 15.7% of college students use marijuana (reported by the
Harvard School of Public Health). Based on this belief, you
perform the hypothesis test shown in Figure 9 below.
· Note that p in this figure means population proportion
and pˆ means sample proportion. On the other hand, p-value
continues to have the same meaning as defined in the glossary.
Because the p-value is greater than .05, the customary alpha
level, the data do not provide enough evidence that the
proportion of marijuana users at the college is higher than the
proportion among all U.S. college students, which is .157.
Step 2
Let’s make some small changes to the above problem. Suppose
that in a simple random sample of 400 students from the
college, 76 admitted to marijuana use as seen in Figure 8 below.
Do the data provide enough evidence to conclude that the
proportion of marijuana users among the students in the college
(p) is higher than the national proportion, which is .157?
Step 3
You now have a larger sample (400 instead of 100), and also the
number of marijuana users is 76 instead of 19. The question of
interest did not change, so if you carry out the test in this case,
you are testing the same hypotheses seen below.
Step 4
You select a random sample of size 400 and find that 76 are
marijuana users, and the formula seen below. This is the same
sample proportion as in the original problem, so it seems that
the data give the same evidence.
Step 5
However, when you calculate the test statistic, you see that
actually this is not the case as seen in the formula below.
Even though the sample proportion is the same (.19), because
here it is based on a larger sample (400 instead of 100), it is
1.81 standard deviations above the null value of .157 (as
opposed to .91 standard deviations in the original problem). The
sampling distribution for the sample proportion has a smaller
standard error because of the larger sample size.
Step 6
The p-value here is .035, as opposed to .182 in the original
problem. In other words, when Ho is true (i.e., if p = .157 at the
certain college), it is quite unlikely (probability of .035) to get
a sample proportion of .19 or higher based on a sample of size
400. When the sample size is 100, the probability of having a
sample proportion greater than .19 is more likely (probability
.182).
The results here are important. With n = 400, the data provide
enough evidence to reject Ho and conclude that the proportion
of marijuana users at the college is higher than among all U.S.
students. With n = 100, however, the evidence is insufficie nt to
reject the null. Figure 9 summarizes these findings.
You can see that results that are based on a larger sample carry
more weight. A sample proportion of .19 based on a sample of
size of 100 was not enough evidence that the proportion of
marijuana users in the college is higher than .157. Recall that
this conclusion (not having enough evidence to reject the null
hypothesis) doesn't mean the null hypothesis is necessarily true;
it only means that the particular study did not yield sufficient
evidence to reject the null. It might be that the sample size was
simply too small to detect a statistically significant difference,
and a type II error was made.
To summarize, you saw that when the sample proportion of .19
is obtained from a sample of size 400, it carries much more
weight, and in particular, provides enough evidence that the
proportion of marijuana users in the college is higher than .157
(the national figure). In this case, the sample size of
400 was large enough to detect a statistically significant
difference.
The following graphs show the power of the two tests if the
population mean proportion p for the certain college is actually
.19. Use the < and > icon to navigate between slides.
· 1
· 2
Figure 10
Figure 11
Figure 12
Finally, Figure 12 shows how sample size affects the test for
proportions concerning marijuana use at the liberal arts college.
The graph is based on a hypothesis test with alpha = .05, the
proportion for the null hypothesis equal to .157, and the
population proportion for the liberal arts college = .19.
In general, whether you are testing hypotheses about
proportions, means, or other parameters, the larger the sample
size, the greater the statistical power. Because of your interest
in rejecting the null, you need to pay attention to how large
your sample size will be prior to collecting data.

More Related Content

Similar to Hypothesis TestingIn doing research, one of the most common acti

Testing of Hypothesis, p-value, Gaussian distribution, null hypothesis
Testing of Hypothesis, p-value, Gaussian distribution, null hypothesisTesting of Hypothesis, p-value, Gaussian distribution, null hypothesis
Testing of Hypothesis, p-value, Gaussian distribution, null hypothesis
svmmcradonco1
 
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docxTopic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
AASTHA76
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
karlhennesey
 
Hypothesis Testing Definitions A statistical hypothesi.docx
Hypothesis Testing  Definitions A statistical hypothesi.docxHypothesis Testing  Definitions A statistical hypothesi.docx
Hypothesis Testing Definitions A statistical hypothesi.docx
wilcockiris
 

Similar to Hypothesis TestingIn doing research, one of the most common acti (20)

Steps in hypothesis.pptx
Steps in hypothesis.pptxSteps in hypothesis.pptx
Steps in hypothesis.pptx
 
importance of P value and its uses in the realtime Significance
importance of P value and its uses in the realtime Significanceimportance of P value and its uses in the realtime Significance
importance of P value and its uses in the realtime Significance
 
P-Value: a true test of significance in agricultural research
P-Value: a true test of significance in agricultural researchP-Value: a true test of significance in agricultural research
P-Value: a true test of significance in agricultural research
 
Testing of Hypothesis, p-value, Gaussian distribution, null hypothesis
Testing of Hypothesis, p-value, Gaussian distribution, null hypothesisTesting of Hypothesis, p-value, Gaussian distribution, null hypothesis
Testing of Hypothesis, p-value, Gaussian distribution, null hypothesis
 
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docxTopic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
Topic Learning TeamNumber of Pages 2 (Double Spaced)Num.docx
 
Bgy5901
Bgy5901Bgy5901
Bgy5901
 
students_t_test.ppt
students_t_test.pptstudents_t_test.ppt
students_t_test.ppt
 
students_t_test.ppt
students_t_test.pptstudents_t_test.ppt
students_t_test.ppt
 
Chapter_9.pptx
Chapter_9.pptxChapter_9.pptx
Chapter_9.pptx
 
Data Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies SummaryData Analysis for Graduate Studies Summary
Data Analysis for Graduate Studies Summary
 
Interpretation
InterpretationInterpretation
Interpretation
 
Basic concept of statistics
Basic concept of statisticsBasic concept of statistics
Basic concept of statistics
 
Research methodology module 3
Research methodology module 3Research methodology module 3
Research methodology module 3
 
PROCEDURE FOR TESTING HYPOTHESIS
PROCEDURE FOR   TESTING HYPOTHESIS PROCEDURE FOR   TESTING HYPOTHESIS
PROCEDURE FOR TESTING HYPOTHESIS
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
 
Unit 3
Unit 3Unit 3
Unit 3
 
What So Funny About Proportion Testv3
What So Funny About Proportion Testv3What So Funny About Proportion Testv3
What So Funny About Proportion Testv3
 
Descriptive Analysis.pptx
Descriptive Analysis.pptxDescriptive Analysis.pptx
Descriptive Analysis.pptx
 
Parameter and statistic in Research Methdology- Module 5
Parameter and statistic in Research Methdology- Module 5Parameter and statistic in Research Methdology- Module 5
Parameter and statistic in Research Methdology- Module 5
 
Hypothesis Testing Definitions A statistical hypothesi.docx
Hypothesis Testing  Definitions A statistical hypothesi.docxHypothesis Testing  Definitions A statistical hypothesi.docx
Hypothesis Testing Definitions A statistical hypothesi.docx
 

More from NarcisaBrandenburg70

1. Can psychological capital impact satisfaction and organizationa.docx
1. Can psychological capital impact satisfaction and organizationa.docx1. Can psychological capital impact satisfaction and organizationa.docx
1. Can psychological capital impact satisfaction and organizationa.docx
NarcisaBrandenburg70
 
1. A logistics specialist for Charm City Inc. must distribute case.docx
1. A logistics specialist for Charm City Inc. must distribute case.docx1. A logistics specialist for Charm City Inc. must distribute case.docx
1. A logistics specialist for Charm City Inc. must distribute case.docx
NarcisaBrandenburg70
 

More from NarcisaBrandenburg70 (20)

1. A frequently asked question is Can structured techniques and obj.docx
1. A frequently asked question is Can structured techniques and obj.docx1. A frequently asked question is Can structured techniques and obj.docx
1. A frequently asked question is Can structured techniques and obj.docx
 
1. Which of the following BEST describes the primary goal of a re.docx
1.  Which of the following BEST describes the primary goal of a re.docx1.  Which of the following BEST describes the primary goal of a re.docx
1. Which of the following BEST describes the primary goal of a re.docx
 
1. Can psychological capital impact satisfaction and organizationa.docx
1. Can psychological capital impact satisfaction and organizationa.docx1. Can psychological capital impact satisfaction and organizationa.docx
1. Can psychological capital impact satisfaction and organizationa.docx
 
1. Apply principles and practices of human resource function2. Dem.docx
1. Apply principles and practices of human resource function2. Dem.docx1. Apply principles and practices of human resource function2. Dem.docx
1. Apply principles and practices of human resource function2. Dem.docx
 
1. A logistics specialist for Charm City Inc. must distribute case.docx
1. A logistics specialist for Charm City Inc. must distribute case.docx1. A logistics specialist for Charm City Inc. must distribute case.docx
1. A logistics specialist for Charm City Inc. must distribute case.docx
 
1. (TCO 4) Major fructose sources include (Points 4)     .docx
1. (TCO 4) Major fructose sources include (Points  4)     .docx1. (TCO 4) Major fructose sources include (Points  4)     .docx
1. (TCO 4) Major fructose sources include (Points 4)     .docx
 
1. Which major change in western society altered the image of chi.docx
1.  Which major change in western society altered the image of chi.docx1.  Which major change in western society altered the image of chi.docx
1. Which major change in western society altered the image of chi.docx
 
1. Briefly explain the meaning of political power and administrative.docx
1. Briefly explain the meaning of political power and administrative.docx1. Briefly explain the meaning of political power and administrative.docx
1. Briefly explain the meaning of political power and administrative.docx
 
1. Assume that you are assigned to conduct a program audit of a gran.docx
1. Assume that you are assigned to conduct a program audit of a gran.docx1. Assume that you are assigned to conduct a program audit of a gran.docx
1. Assume that you are assigned to conduct a program audit of a gran.docx
 
1. Which of the following is most likely considered a competent p.docx
1.  Which of the following is most likely considered a competent p.docx1.  Which of the following is most likely considered a competent p.docx
1. Which of the following is most likely considered a competent p.docx
 
1. The most notable philosophies influencing America’s founding w.docx
1.  The most notable philosophies influencing America’s founding w.docx1.  The most notable philosophies influencing America’s founding w.docx
1. The most notable philosophies influencing America’s founding w.docx
 
1. The disadvantages of an automated equipment operating system i.docx
1.  The disadvantages of an automated equipment operating system i.docx1.  The disadvantages of an automated equipment operating system i.docx
1. The disadvantages of an automated equipment operating system i.docx
 
1. Which one of the following occupations has the smallest percen.docx
1.  Which one of the following occupations has the smallest percen.docx1.  Which one of the following occupations has the smallest percen.docx
1. Which one of the following occupations has the smallest percen.docx
 
1. Unless otherwise specified, contracts between an exporter and .docx
1.  Unless otherwise specified, contracts between an exporter and .docx1.  Unless otherwise specified, contracts between an exporter and .docx
1. Unless otherwise specified, contracts between an exporter and .docx
 
1. Which Excel data analysis tool returns the p-value for the F-t.docx
1.  Which Excel data analysis tool returns the p-value for the F-t.docx1.  Which Excel data analysis tool returns the p-value for the F-t.docx
1. Which Excel data analysis tool returns the p-value for the F-t.docx
 
1. The common currency of most of the countries of the European U.docx
1.  The common currency of most of the countries of the European U.docx1.  The common currency of most of the countries of the European U.docx
1. The common currency of most of the countries of the European U.docx
 
1. Expected value” in decision analysis is synonymous with most.docx
1.  Expected value” in decision analysis is synonymous with most.docx1.  Expected value” in decision analysis is synonymous with most.docx
1. Expected value” in decision analysis is synonymous with most.docx
 
1. Anna gathers leaves that have fallen from a neighbor’s tree on.docx
1.  Anna gathers leaves that have fallen from a neighbor’s tree on.docx1.  Anna gathers leaves that have fallen from a neighbor’s tree on.docx
1. Anna gathers leaves that have fallen from a neighbor’s tree on.docx
 
1. One of the benefits of a railroad merger is (Points 1)     .docx
1.  One of the benefits of a railroad merger is (Points  1)     .docx1.  One of the benefits of a railroad merger is (Points  1)     .docx
1. One of the benefits of a railroad merger is (Points 1)     .docx
 
1. President Woodrow Wilson played a key role in directing the na.docx
1.  President Woodrow Wilson played a key role in directing the na.docx1.  President Woodrow Wilson played a key role in directing the na.docx
1. President Woodrow Wilson played a key role in directing the na.docx
 

Recently uploaded

1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Recently uploaded (20)

UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 

Hypothesis TestingIn doing research, one of the most common acti

  • 1. Hypothesis Testing In doing research, one of the most common activities is testing hypotheses. The Afrobarometer data set below is a survey of African citizens’ attitudes on democracy, governance, the economy, and other related topics (www.afrobarometer.org). Using this data set, you might want to examine hypotheses related to whether rural and urban citizens differ, on average, in how much they trust the government. The tables below present results from an independent samples t-test to examine these hypotheses using a random sample of 44 participants from the complete data set. Each respondent’s score is a value between 0 and 15 with a higher score indicating greater trust. You can see that the mean for the urban group is 7.00 ( SD = 4.17) and the mean for the rural group is 7.74 ( SD = 4.38). The observed value of the t-statistic is -.564 and the p-value equals 0.576 (see the column labeled “Sig. (2-tailed)”). African Citizens' Attitudes on Democracy The tables below present results from an independent samples t- test to examine these hypotheses using a random sample of 44 participants from the complete data set. Each respondent’s score is a value between 0 and 15 with a higher score indicating greater trust. You can see that the mean for the urban group is 7.00 ( SD = 4.17) and the mean for the rural group is 7.74 ( SD = 4.38). The observed value of the t-statistic is -.564 and the p-value equals 0.576 (see the column labeled “Sig. (2- tailed)”). t df Sig. (2-tailed) Mean Difference Std. Error Difference Trust in Government Index
  • 2. (higher scores = more trust) -.564 41 .576 -.73913 1.30978 Group Statistics Urban or Rural Primary Sampling Unit N Mean Std. Deviation Std. Error Mean Trust in Government Index (higher scores = more trust) Urban 20 7.000 4.16754 .93189 Rural 30 7.7391 4.38196 .91370 The p-value is the probability of obtaining a value more extreme than .564 (less than -.564 or greater than +.564) if you were to repeat the test with a new sample of data and if the null hypothesis is true. You will see in this Skill Builder that the p- valuecan easily be used to make statistical decisions in hypothesis testing. However, while the p-valueis important in determining statistical significance, it does not tell the whole story. Steps of Hypothesis Testing
  • 3. To interpret p-values, let's review the key steps in hypothesis testing. Use the < and > icons to navigate between the steps. Step 1 State the null and alternative hypotheses Recall that hypotheses are statements about population parameters. For the Trust in Government example from the Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image. The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variable (“urban” and “rural”). Here the null is saying that the mean for the urban population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same. One-tailed vs. Two-tailed Tests One important factor to be aware of is whether the test you are conducting is one-tailed or two-tailed. So far, the hypotheses have been written for a two-tailed test, which means that the alternative hypothesis stated simply that there was a difference between the means, without specifying the direction of the difference. In a one-tailed test, the alternative hypothesis does specify the direction of the difference; that is, it specifies that one of the means (e.g., urban or rural) is expected to be larger than the other. In a one-tailed test, the p-value will be the area in the test statistic distribution to the right of the observed value if the alternative hypothesis has an “is greater than” sign, and to the left of the observed value if the alternative hypothesis has an “is less than” sign. For example, suppose we had the following hypothesis test: In this hypothesis test, the alternative hypothesis HA states that the mean for the urban population will be greater than the mean
  • 4. for the rural population. The p-value would, therefore, be determined by the area to the right of the observed value of the test statistic using the sampling distribution for the test statistic. For a two-tailed test, as is being illustrated with the Afrobarometer data file, the area beyond the observed value is doubled to obtain the p-value. The reason for doubling is related to setting the rejection region for a two-tailed test. For a two- tailed test, alpha is divided in half (α/2), and the “half-areas” are used to identify rejection regions in both the upper and lower tails of the test statistic’s sampling distribution. The doubling of the area beyond the observed value allows the p-value to be compared to alpha to test the null hypothesis. Figure 1 Figure 1 shows the p-value determination for the Afrobarometer hypothesis test. In the SPSS output, the observed value for the t-statistic is -0.564. Because the value of t is negative, the more extreme values of t are considered to be to the left of - 0.564. As shown in Figure 1, the area under the t curve and less than -0.564 is .288. Because of the two-tailed test, however, the area is doubled to account for the probability of the test statistic taking on a value greater than +0.564. Hence the p-value for the hypothesis test is .576. Again, if alpha had been set equal to .05, the null hypothesis would be retained (fail to reject) because .576 is greater than .05. That is, the data support the position that in the populations of urban and rural citizens, there is no difference in average levels of trust in government. Keep in mind the following important points related to making a statistical decision and interpreting your p-value: · bullet By definition, the p-value is the probability of obtaining a value for the test statistic as extreme or more extreme than the observed value if the null hypothesis is true. · bullet
  • 5. If the p-value is less than alpha, the null is rejected, and the result is said to be statistically significant. · bullet If the p-value is greater than alpha, then researchers would fail to reject the null hypothesis. Statistically Significant Results The final step in conducting a hypothesis test is to link the statistical result to the real-world. That is, you need to examine the practical significance or the meaningfulness of the statistical result. If the result of the hypothesis test is to retain the null —that is, obtain a non-significant result—the researcher has clearly not identified a meaningful effect. In most hypothesis tests, retaining the null is not what the researcher is hoping to do. On the other hand, if you reject the null hypothesis, you will have a statistically significant result. You are, in essence, saying that the result is so unlikely under the assumption of the null being true that the null appears to be false. A false null hypothesis does not mean, however, that the result is scientifically or socially important. When a researcher finds a statistically significant result, knowledge of the research area is used to decide whether the result is important and meaningful. Large effects are more often meaningful than small effects, but there are times when small effects can be important. Knowledge of the research area is key in making the decision. Probably the most frequent concern with meaningless statistically significant results has to do with sample size. With extremely large sample sizes, hypothesis tests can result in rejecting the null even though the effect is small and unimportant from an applied perspective. To understand how this works, let’s take another look at the Afrobarometer data set. Participants in the survey were asked whether they agreed or disagreed with the statement, “People must obey the law.” Responses were made using a five-point Likert scale:
  • 6. 1 2 3 4 5 strongly disagree disagree neither agree nor disagree agree strongly agree Suppose a researcher had wanted to compare the urban and rural populations and tested the null hypothesis Ho : μurban = μrural using alpha equal to .05. Unlike the example above that used a sample of 43 participants, the following results are based on over 50,000 respondents. As shown in the following table, the p-value ( Sig (2-tailed)) for this test is .004. t df Sig (2-tailed) Mean Difference Q48b. People must obey the law Equal variances assumed -.2892 50125 .004 -.029 Using APA style, the researcher could report that, on average, the urban population agrees less with the statement than does the rural population, t (50125) = -2.892, p = .004, d = .027, 95% CI [-.039, -.019]. · bullet The statement says the t-test was conducted with 50,125 degrees of freedom or 50,127 participants.
  • 7. · bullet The p-value of .004 is less than alpha, so the null hypothesis is rejected. · bullet The d statistic is Cohen’s d, a common measure of effect size. · bullet The 95% confidence interval for the difference in population means does not contain zero, which is consistent with having rejected the null hypothesis. There is no doubt the result is statistically significant, but how meaningful is it? The d-statistic is quite useful because it compares the difference in sample means to an average of the standard deviations for the two groups. (The average standard deviation is based on a weighted average of the two sample variances.) According to Cohen, d = .2 is generally considered a small effect, d = .5 a medium effect, and d = .8 a large effect. The value of .027 is little more than 10% of a small effect. The statistically significant result that was obtained is therefore not likely to be important. Statistical Power Statistical power is the probability of rejecting a null hypothesis if the null is false (i.e., the alternative is true). It is the degree to which the researcher is able to detect an effect if there actually is one. With low statistical power, a researcher may struggle to detect an effect (to reject the null), even if an effect actually occurs in the population. Suppose you are planning an experiment involving stereotype threat. Stereotype threat is defined as a tendency to behave in a manner consistent with negative beliefs that others have about a racial or gender group. For example, if some black test takers are told that as a group, black test takers do not perform well on math tests, performance among those black test-takers is worse than for black test takers for whom the stereotype is not evoked. One question you will need to answer is how many participants
  • 8. should you include in your study to be confident in identifying the effect? In other words, how many participants do you need in order to have adequate statistical power in your study? The Affect of Statistical Power Understanding how several factors affect the statistical power of a study will help you to understand and critique research findings and will also lead to greater satisfaction with your own research. When conducting your own research studies, you should do a power analysis prior to collecting data to make sure you have a good chance of demonstrating the effect you are looking for. There are three main factors that affect how much statistical power you have in your study: · 1 1 Alpha (i.e., the probability of a type I error) · 2 2 Effect size (i.e., the difference between the population means for the experimental and control groups) · 3 3 Sample size (i.e., n ) As a researcher, you have control over alpha and sample size. The effect size, however, is not under your control and is predetermined. What will be important to you is having an idea about how great the effect may be. This Skill Builder is concerned with how alpha, effect size, and sample size are related to statistical power. A Review of Hypothesis Testing Before discussing power, let’s review the basics of hypothesis testing: · bullet The null hypothesis is the statement of no effect. · bullet The alternative hypothesis is a statement that an effect exists in
  • 9. the population. · bullet Obtaining a significant result means that you have rejected the null hypothesis and have concluded that it’s likely that there is an effect in the population. · bullet A type I error happens when the null hypothesis is true but you reject it erroneously. This is referred to as a false positive. · bullet A type II error happens when the null hypothesis is false but you fail to reject it. This is referred to as a false negative. Reviewing Type I and Type II Errors Type I and type II errors and their probabilities are important concepts when thinking about hypothesis testing. These error events are called “conditional,” meaning that the events can only occur under certain conditions. The following is the language that is used to talk about these conditional events: · Alpha (α) = P(type I error) = P(Reject H 0 |H 0 is true) which is read as the probability of a type I error equals the probability of rejecting the null hypothesis given the null is true. · Beta (β) = P(type II error) = P(Retain H 0 |H A is true) which is read as the probability of a type II error equals the probability of retaining the null hypothesis given the alternative hypothesis is true. Table 1 shows the possible outcomes for a hypothesis test. Table 1: Possible Outcomes for a Hypothesis Test D True State of Nature Decision Ho is true Ho is false Retain Ho Correct decision Type II error Reject Ho
  • 10. Type 1 error Correct decision Power Analysis Power analysis is the process of examining a test of the null hypothesis to determine the chances of rejecting it and placing belief in the alternative hypothesis. Researchers typically want to get a sense of how much statistical power they will have in their study before coll ecting data. In order to do so, they usually conduct a power analysis. Suppose you design a study, and a part of it is to demonstrate stereotype threat involving females. Nguyen and Ryan (2008) provide results that indicate the average Cohen’s d in previous studies of gender-based stereotype threat for cognitive tests is about .21. This means that over many studies, females who are NOT made aware of a gender stereotype (NOT primed) score about 0.2 standard deviations higher on cognitive tests than females who are made aware of a gender effect (primed). To demonstrate this effect in your study, you will test the following null hypothesis: HA : μNOT primed − μprimed ≤ 0 If you reject the null, you will place your confidence in the following alternative hypothesis: HA : μNOT primed − μprimed > 0 μNOT primed Indicates the population mean for the “not primed” condition. μprimed Indicates the population mean for the “primed” condition. HA : μNOT primed − μprimed > 0 The alternative hypothesis specifies that the “not primed” condition will score higher than the “primed” condition. To test this null hypothesis, you would examine a test statistic distribution and note the area in the upper tail of the distribution equal to alpha. Suppose you plan to test this hypothesis with a t-test with 50 participants in each condition (primed or NOT primed).
  • 11. Figure 1 sampling distribution shows what you should expect for the values of the test statistic if the null hypothesis is true. In order to reject the null hypothesis, the t value would need to be greater than 1.66055. Figure 1 Because the test statistic is a continuous variable, the curve shows probability density, and probability is found by determining the area under the curve. The entire area under the curve, between - ∞and+ ∞, is 1.00. To find the probability of a statistic taking on a value within a certain range, you need to find the area under the curve within the range. For example, there are tables that will tell you that the area under the curve between t = 0 and t = +1 corresponds to a probability of about .34. Most importantly, because alpha has been set equal to .05, the area beyond 1.66 corresponds to a probability of .05. Fortunately, statistical programs calculate the areas for you, and you do not need to do the calculations yourself. Nevertheless, the essence of hypothesis testing is that if you obtain a value of t greater than 1.66, you will say, “This is not a very likely event if the null is true. Thus, the null hypothesis is probably not true because the alternative hypothesis provides a more likely explanation.” In making the decision to reject the null, however, you recognize that if the null is, in fact, true, you are making a type I error. While alpha provides assurance that the researcher has a small chance of making a type I error, you are also interested in what will happen if the null hypothesis is false—the real world expectation that is driving you to do the study. Figure 2 Now, in Figure 2, switch your focus from the curve on the left and attend to the curve on the right formed by the dashed line. This curve is based on the alternative hypothesis (i.e., that the unprimed group performs better than the primed group).
  • 12. To construct this curve based on the alternative hypothesis, a specific value for the difference in means had to be specified; in this case, the value of d = .21, the overall gender effect that Nguyen and Ryan (2008) found. Note, again, that the vertical line with t = 1.66 separates the values of the test statistic that lead to rejecting versus retaining the null hypothesis, and that the line is based on the null hypothesis. The statistical power of the test, (1-β), is the area under the curve with the dashed lines and to the right of the vertical line for t = 1.69. The area designated by beta (β), to the left of the vertical line, corresponds to the probability of a type II error, retaining the null if the null is actually false. In this example, note that the area corresponding to power (1-β) is less than the area corresponding to β. Hence, you can conclude that the power is less than 0.5 because the sum of the two areas is 1.0. Almost always, you would like statistical power to be greater than beta for the important hypothesis tests in your study. In this example, a plan to do an experiment with 50 participants in each group may be doomed. The statistical power of the test (.27) is relatively low, and the risk of making a type II error is relatively high. In other words, the statistical power of the test, as currently constructed, limits your ability to detect a gender effect of priming versus not priming if there is one. Numbered divider 1 Consider the following scenario when answering the question below. You are planning a study of stereotype threat and are concerned you may not be able to detect a significant result, even though you believe your experimental procedures should induce the stereotype threat effect. Hint: A type I error happens when the null hypothesis is true and you reject it. Which of the following errors are you concerned about? Type I error Type II error
  • 13. TAKE AGAIN The Relationship Between Power and Sample Size Prior discussions have focused on testing hypotheses about population means, but you can also do hypothesis tests involving population proportions. In general, larger sample sizes give you more information to pin down the true nature of the population. You can, therefore, expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. As a result, for the same level of confidence, you can report a smaller margin of error, and get a narrower confidence interval. In other words, larger sample sizes increase how much you trust your sample results. In the two scenarios below, you will see that a larger sample size results in a greater ability to reject the null when an effect actually exists in the population. Scenario: Examining Marijuana Use Imagine you are a researcher examining marijuana use at a certain liberal arts college and read through the scenario below. Step 1 You believe that marijuana use at the college is greater than the national average, for which large-scale studies have shown that about 15.7% of college students use marijuana (reported by the Harvard School of Public Health). Based on this belief, you perform the hypothesis test shown in Figure 9 below. · Note that p in this figure means population proportion and pˆ means sample proportion. On the other hand, p-value continues to have the same meaning as defined in the glossary. Because the p-value is greater than .05, the customary alpha level, the data do not provide enough evidence that the proportion of marijuana users at the college is higher than the proportion among all U.S. college students, which is .157. Step 2 Let’s make some small changes to the above problem. Suppose that in a simple random sample of 400 students from the
  • 14. college, 76 admitted to marijuana use as seen in Figure 8 below. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is .157? Step 3 You now have a larger sample (400 instead of 100), and also the number of marijuana users is 76 instead of 19. The question of interest did not change, so if you carry out the test in this case, you are testing the same hypotheses seen below. Step 4 You select a random sample of size 400 and find that 76 are marijuana users, and the formula seen below. This is the same sample proportion as in the original problem, so it seems that the data give the same evidence. Step 5 However, when you calculate the test statistic, you see that actually this is not the case as seen in the formula below. Even though the sample proportion is the same (.19), because here it is based on a larger sample (400 instead of 100), it is 1.81 standard deviations above the null value of .157 (as opposed to .91 standard deviations in the original problem). The sampling distribution for the sample proportion has a smaller standard error because of the larger sample size. Step 6 The p-value here is .035, as opposed to .182 in the original problem. In other words, when Ho is true (i.e., if p = .157 at the certain college), it is quite unlikely (probability of .035) to get a sample proportion of .19 or higher based on a sample of size 400. When the sample size is 100, the probability of having a sample proportion greater than .19 is more likely (probability .182).
  • 15. The results here are important. With n = 400, the data provide enough evidence to reject Ho and conclude that the proportion of marijuana users at the college is higher than among all U.S. students. With n = 100, however, the evidence is insufficient to reject the null. Figure 9 summarizes these findings. You can see that results that are based on a larger sample carry more weight. A sample proportion of .19 based on a sample of size of 100 was not enough evidence that the proportion of marijuana users in the college is higher than .157. Recall that this conclusion (not having enough evidence to reject the null hypothesis) doesn't mean the null hypothesis is necessarily true; it only means that the particular study did not yield sufficient evidence to reject the null. It might be that the sample size was simply too small to detect a statistically significant difference, and a type II error was made. To summarize, you saw that when the sample proportion of .19 is obtained from a sample of size 400, it carries much more weight, and in particular, provides enough evidence that the proportion of marijuana users in the college is higher than .157 (the national figure). In this case, the sample size of 400 was large enough to detect a statistically significant difference. The following graphs show the power of the two tests if the population mean proportion p for the certain college is actually .19. Use the < and > icon to navigate between slides. · 1 · 2 Figure 10 Figure 11
  • 16. Figure 12 Finally, Figure 12 shows how sample size affects the test for proportions concerning marijuana use at the liberal arts college. The graph is based on a hypothesis test with alpha = .05, the proportion for the null hypothesis equal to .157, and the population proportion for the liberal arts college = .19. In general, whether you are testing hypotheses about proportions, means, or other parameters, the larger the sample size, the greater the statistical power. Because of your interest in rejecting the null, you need to pay attention to how large your sample size will be prior to collecting data. © 2016 Laureate Education, Inc. Page 1 of 2 Week 5 Scenarios 1. The p-value was slightly above conventional threshold, but was described as “rapidly approaching significance” (i.e., p =.06). An independent samples t test was used to determine whether student satisfaction levels in a quantitative reasoning course differed between the traditional classroom and on-line environments. The samples consisted of students in four face-to-face classes at a traditional state university (n = 65) and four online
  • 17. classes offered at the same university (n = 69). Students reported their level of satisfaction on a five- point scale, with higher values indicating higher levels of satisfaction. Since the study was exploratory in nature, levels of significance were relaxed to the .10 level. The test was significant t(132) = 1.8, p = .074, wherein students in the face-to-face class reported lower levels of satisfaction (M = 3.39, SD = 1.8) than did those in the online sections (M = 3.89, SD = 1.4). We therefore conclude that on average, students in online quantitative reasoning classes have higher levels of satisfaction. The results of this study are significant because they provide educators with evidence of what medium works better in producing quantitatively knowledgeable practitioners. 2. A results report that does not find any effect and also has small sample size (possibly no effect detected due to lack of power). A one-way analysis of variance was used to test whether a relationship exists between educational attainment and race. The dependent variable of education was measured as number of years of education completed. The race factor had three attributes of European American (n = 36), African American (n = 23) and Hispanic (n = 18). Descriptive statistics indicate that on average, European Americans have higher levels of education (M = 16.4, SD =
  • 18. 4.6), with African Americans slightly trailing (M = 15.5, SD = 6.8) and Hispanics having on average lower levels of educational attainment (M = 13.3, SD = 6.1). The ANOVA was not significant F (2,74) = 1.789, p = .175, indicating there are no differences in educational attainment across these three races in the population. The results of this study are significant because they shed light on the current social conversation about inequality. 3. Statistical significance is found in a study, but the effect in reality is very small (i.e., there was a very minor difference in attitude between men and women). Were the results meaningful? An independent samples t test was conducted to determine whether differences exist between men and women on cultural competency scores. The samples consisted of 663 women and 650 men taken from a convenience sample of public, private, and non-profit organizations. Each participant was administered an instrument that measured his or her current levels of cultural competency. The © 2016 Laureate Education, Inc. Page 2 of 2 cultural competency score ranges from 0 to 10, with higher scores indicating higher
  • 19. levels of cultural competency. The descriptive statistics indicate women have higher levels of cultural competency (M = 9.2, SD = 3.2) than men (M = 8.9, SD = 2.1). The results were significant t (1311) = 2.0, p <.05, indicating that women are more culturally competent than are men. These results tell us that gender-specific interventions targeted toward men may assist in bolstering cultural competency. 4. A study has results that seem fine, but there is no clear association to social change. What is missing? A correlation test was conducted to determine whether a relationship exists between level of income and job satisfaction. The sample consisted of 432 employees equally represented across public, private, and non- profit sectors. The results of the test demonstrate a strong positive correlation between the two variables, r =.87, p < .01, showing that as level of income increases, job satisfaction increases as well. Assignment: Evaluating Significance of Findings Part of your task as a scholar-practitioner is to act as a critical consumer of research and ask informed questions of published material. Sometimes, claims are made that do not match the results of the analysis. Unfortunately, this is why statistics is sometimes unfairly associated with telling lies. These misalignments might not be solely attributable to statistical
  • 20. nonsense, but also “user error.” One of the greatest areas of user error is within the practice of hypothesis testing and interpreting statistical significance. As you continue to consume research, be sure and read everything with a critical eye and call out statements that do not match the results. For this Assignment, you will examine statistical significance and meaningfulness based on sample statements. To prepare for this Assignment: · Review the Week 5 Scenarios found in this week’s Learning Resources and select two of the four scenarios for this Assignment. · For additional support, review the Skill Builder: Evaluating P Values and the Skill Builder: Statistical Power, which you can find by navigating back to your Blackboard Course Home Page. From there, locate the Skill Builder link in the left navigation pane. For this Assignment: Critically evaluate the two scenarios you selected based upon the following points: · Critically evaluate the sample size. · Critically evaluate the statements for meaningfulness. · Critically evaluate the statements for statistical significance. · Based on your evaluation, provide an explanation of the implications for social change. Use proper APA format and citations, and referencing. https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf Frankfort-Nachmias, C., Leon-Guerrero, A., & Davis, G. (2020). Social statistics for a diverse society (9th ed.). Thousand Oaks, CA: Sage Publications. · Chapter 8, “Testing Hypothesis: Assumptions of Statistical Hypothesis Testing” (pp. 241-242)
  • 21. Wagner, III, W. E. (2020). Using IBM® SPSS® statistics for research methods and social science statistics (7th ed.) . Thousand Oaks, CA: Sage Publications. · Chapter 6, “Testing Hypotheses Using Means and Cross- Tabulation” https://content.waldenu.edu/content/dam/laureate/laureate- academics/wal/xx-rsch/rsch- 8210/readings/USW1_RSCH_8210_Week05_Warner_chapter03. pdf Walden University, LLC. (Producer). (2016f). Meaningfulness vs. statistical significance [Video file]. Baltimore, MD: Author. Note: The approximate length of this media piece is 4 minutes. In this media program, Dr. Matt Jones discusses the differences in meaningfulness and statistical significance. Focus on how this information will inform your Discussion and Assignment for this week. Skill builder: Evaluating P Values Hypothesis Testing In doing research, one of the most common activities is testing hypotheses. The Afrobarometer data set below is a survey of African citizens’ attitudes on democracy, governance, the economy, and other related topics (www.afrobarometer.org). Using this data set, you might want to examine hypotheses related to whether rural and urban citizens differ, on average, in how much they trust the government. The tables below present results from an independent samples t-test to examine these hypotheses using a random sample of 44 participants from the
  • 22. complete data set. Each respondent’s score is a value between 0 and 15 with a higher score indicating greater trust. You can see that the mean for the urban group is 7.00 ( SD = 4.17) and the mean for the rural group is 7.74 ( SD = 4.38). The observed value of the t-statistic is -.564 and the p-value equals 0.576 (see the column labeled “Sig. (2-tailed)”). African Citizens' Attitudes on Democracy The tables below present results from an independent samples t- test to examine these hypotheses using a random sample of 44 participants from the complete data set. Each respondent’s score is a value between 0 and 15 with a higher score indicating greater trust. You can see that the mean for the urban group is 7.00 ( SD = 4.17) and the mean for the rural group is 7.74 ( SD = 4.38). The observed value of the t-statistic is -.564 and the p-value equals 0.576 (see the column labeled “Sig. (2- tailed)”). t df Sig. (2-tailed) Mean Difference Std. Error Difference Trust in Government Index (higher scores = more trust) -.564 41 .576 -.73913 1.30978 Group Statistics Urban or Rural Primary Sampling Unit N Mean
  • 23. Std. Deviation Std. Error Mean Trust in Government Index (higher scores = more trust) Urban 20 7.000 4.16754 .93189 Rural 30 7.7391 4.38196 .91370 The p-value is the probability of obtaining a value more extreme than .564 (less than -.564 or greater than +.564) if you were to repeat the test with a new sample of data and if the null hypothesis is true. You will see in this Skill Builder that the p- valuecan easily be used to make statistical decisions in hypothesis testing. However, while the p-valueis important in determining statistical significance, it does not tell the whole story. Steps of Hypothesis Testing To interpret p-values, let's review the key steps in hypothesis testing. Use the < and > icons to navigate between the steps. Step 1 State the null and alternative hypotheses Recall that hypotheses are statements about population parameters. For the Trust in Government example from the Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image. The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variable (“urban” and “rural”). Here the null is saying that the mean for the urban
  • 24. population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same. Step 2 Set alpha , the probability of a type I error Frequently, the value of alpha is set equal to 0.05, although researchers are free to use other values. If using an alpha of .05, then researchers are specifying that there is a 5% chance that they will reject the null when, in fact, it should not be rejected. Setting alpha at .05 is popular because there is relatively minimal risk of making a type I error, and alpha is not so small that researchers greatly increase their risk of not rejecting the null when they actually should (a type II error). So in setting alpha, researchers have to be aware of both the risk of r ejecting the null erroneously and of not rejecting it when they actually should. For our Afrobarometer example, we will set alpha at .05. Step 1 State the null and alternative hypotheses Recall that hypotheses are statements about population parameters. For the Trust in Government example from the Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image. The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variabl e (“urban” and “rural”). Here the null is saying that the mean for the urban population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same. Step 2 Set alpha , the probability of a type I error
  • 25. Frequently, the value of alpha is set equal to 0.05, although researchers are free to use other values. If using an alpha of .05, then researchers are specifying that there is a 5% chance that they will reject the null when, in fact, it should not be rejected. Setting alpha at .05 is popular because there is relatively minimal risk of making a type I error, and alpha is not so small that researchers greatly increase their risk of not rejecting the null when they actually should (a type II error). So in setting alpha, researchers have to be aware of both the risk of rejecting the null erroneously and of not rejecting it when they actually should. For our Afrobarometer example, we will set alpha at .05. Step 3 Decide on a test statistic Because of a desire to compare two groups (rural and urban), a t-test for two independent samples is being used. Step 1 State the null and alternative hypotheses Recall that hypotheses are statements about population parameters. For the Trust in Government example from the Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image. The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variable (“urban” and “rural”). Here the null is saying that the mean for the urban population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same. Step 2 Set alpha , the probability of a type I error Frequently, the value of alpha is set equal to 0.05, although researchers are free to use other values. If using an alpha of .05,
  • 26. then researchers are specifying that there is a 5% chance that they will reject the null when, in fact, it should not be rejected. Setting alpha at .05 is popular because there is relatively minimal risk of making a type I error, and alpha is not so small that researchers greatly increase their risk of not rejecting the null when they actually should (a type II error). So in setting alpha, researchers have to be aware of both the risk of rejecting the null erroneously and of not rejecting it when they actually should. For our Afrobarometer example, we will set alpha at .05. Step 3 Decide on a test statistic Because of a desire to compare two groups (rural and urban), a t-test for two independent samples is being used. 7 Step 4 Collect the data and examine the model assumptions Before calculating the value for your test statistic, be sure you have checked assumptions, like homogeneity of variance and the absence of outliers. Step 5 Calculate the observed value of the test statistic Once the data have been collected, the observed value of the test statistic will be used to make a statistical decision. In the Afrobarometer example, the observed value of the test statistic is -.564, sometimes written as tobserved(41)= −.564 where the 41 is the number of degrees of freedom associated with the test. Step 1 State the null and alternative hypotheses Recall that hypotheses are statements about population parameters. For the Trust in Government example from the
  • 27. Afrobarometer data set, the null (HO) and alternative hypotheses (HA) is seen in the above image. The Greek letter, µ, indicates a population mean, and the subscripts indicate levels of the independent variable (“urban” and “rural”). Here the null is saying that the mean for the urban population on the Trust In Government variable is the same as the mean for the rural population. The alternative hypothesis states that these means are not the same. Step 2 Set alpha , the probability of a type I error Frequently, the value of alpha is set equal to 0.05, although researchers are free to use other values. If using an alpha of .05, then researchers are specifying that there is a 5% chance that they will reject the null when, in fact, it should not be rejected. Setting alpha at .05 is popular because there is relatively minimal risk of making a type I error, and alpha is not so small that researchers greatly increase their risk of not rejecting the null when they actually should (a type II error). So in setting alpha, researchers have to be aware of both the risk of rejecting the null erroneously and of not rejecting it when they actually should. For our Afrobarometer example, we will set alpha at .05. Step 3 Decide on a test statistic Because of a desire to compare two groups (rural and urban), a t-test for two independent samples is being used. Step 4 Collect the data and examine the model assumptions Before calculating the value for your test statistic, be sure you have checked assumptions, like homogeneity of variance and the absence of outliers. Step 5 Calculate the observed value of the test statistic Once the data have been collected, the observed value of the test statistic will be used to make a statistical decision. In the Afrobarometer example, the observed value of the test statistic
  • 28. is -.564, sometimes written as tobserved(41)= −.564 where the 41 is the number of degrees of freedom associated with the test. Step 6 Make a statistical decision using the observed value This decision requires examining the distribution of the test statistic under the assumption the null hypothesis is true. Practically, the area in the tail of the distribution beyond the observed value of the test statistic, called the p-value, needs to be determined (see the figure above). Fortunately, computer programs can do the calculation of the area quickly and easily. If the probability is less than alpha (e.g., .05), we will reje ct the null hypothesis. Thus, if you set alpha equal to .05 and the p- value for your test statistic is any value less than .05, you will reject the null hypothesis. Otherwise, retain the null. Step 7 Make a real-world decision The statistical decision is focused on the abstract hypothesis test. The final step is to examine the implications of the statistical decision in the real world. You will need to consider whether your results are practically significant. It turns out that not all statistically significant results are important in the real world. We will discuss more about this later in the Skill Builder. Skill Builder: Statistical Power, Statistical Power Statistical power is the probability of rejecting a null hypothesis if the null is false (i.e., the alternative is true). It is the degree to which the researcher is able to detect an effect if there actually is one. With low statistical power, a researcher may struggle to detect an effect (to reject the null), even if an effect actually occurs in the population.
  • 29. Suppose you are planning an experiment involving stereotype threat. Stereotype threat is defined as a tendency to behave in a manner consistent with negative beliefs that others have about a racial or gender group. For example, if some black test takers are told that as a group, black test takers do not perform well on math tests, performance among those black test-takers is worse than for black test takers for whom the stereotype is not evoked. One question you will need to answer is how many participants should you include in your study to be confident in identifying the effect? In other words, how many participants do you need in order to have adequate statistical power in your study? The Affect of Statistical Power Understanding how several factors affect the statistical power of a study will help you to understand and critique research findings and will also lead to greater satisfaction with your own research. When conducting your own research studies, you should do a power analysis prior to collecting data to make sure you have a good chance of demonstrating the effect you are looking for. There are three main factors that affect how much statistical power you have in your study: · 1 1 Alpha (i.e., the probability of a type I error) · 2 2 Effect size (i.e., the difference between the population means for the experimental and control groups) · 3 3 Sample size (i.e., n ) As a researcher, you have control over alpha and sample size. The effect size, however, is not under your control and is predetermined. What will be important to you is having an idea about how great the effect may be. This Skill Builder is concerned with how alpha, effect size, and sample size are
  • 30. related to statistical power. A Review of Hypothesis Testing Before discussing power, let’s review the basics of hypothesis testing: · bullet The null hypothesis is the statement of no effect. · bullet The alternative hypothesis is a statement that an effect exists in the population. · bullet Obtaining a significant result means that you have rejected the null hypothesis and have concluded that it’s likely that there is an effect in the population. · bullet A type I error happens when the null hypothesis is true but you reject it erroneously. This is referred to as a false positive. · bullet A type II error happens when the null hypothesis is false but you fail to reject it. This is referred to as a false negative. Reviewing Type I and Type II Errors Type I and type II errors and their probabilities are important concepts when thinking about hypothesis testing. These error events are called “conditional,” meaning that the events can only occur under certain conditions. The following is the language that is used to talk about these conditional events: · Alpha (α) = P(type I error) = P(Reject H 0 |H 0 is true) which is read as the probability of a type I error equals the probability of rejecting the null hypothesis given the null is true. · Beta (β) = P(type II error) = P(Retain H 0 |H A is true) which is read as the probability of a type II error equals the probability of retaining the null hypothesis given the alternative hypothesis is true. Table 1 shows the possible outcomes for a hypothesis test.
  • 31. Table 1: Possible Outcomes for a Hypothesis Test D True State of Nature Decision Ho is true Ho is false Retain Ho Correct decision Type II error Reject Ho Type 1 error Correct decision Power Analysis Power analysis is the process of examining a test of the null hypothesis to determine the chances of rejecting it and placing belief in the alternative hypothesis. Researchers typically want to get a sense of how much statistical power they will have in their study before collecting data. In order to do so, they usually conduct a power analysis. Suppose you design a study, and a part of it is to demonstrate stereotype threat involving females. Nguyen and Ryan (2008) provide results that indicate the average Cohen’s d in previous studies of gender-based stereotype threat for cognitive tests is about .21. This means that over many studies, females who are NOT made aware of a gender stereotype (NOT primed) score about 0.2 standard deviations higher on cognitive tests than females who are made aware of a gender effect (primed). To demonstrate this effect in your study, you will test the following null hypothesis: HA : μNOT primed − μprimed ≤ 0 If you reject the null, you will place your confidence in the following alternative hypothesis: HA : μNOT primed − μprimed > 0 μNOT primed Indicates the population mean for the “not primed” condition.
  • 32. μprimed Indicates the population mean for the “primed” condition. HA : μNOT primed − μprimed > 0 The alternative hypothesis specifies that the “not primed” condition will score higher than the “primed” condition. To test this null hypothesis, you would examine a test statistic distribution and note the area in the upper tail of the distribution equal to alpha. Suppose you plan to test this hypothesis with a t-test with 50 participants in each condition (primed or NOT primed). Figure 1 sampling distribution shows what you should expect for the values of the test statistic if the null hypothesis is true. In order to reject the null hypothesis, the t value would need to be greater than 1.66055. Figure 1 Because the test statistic is a continuous variable, the cur ve shows probability density, and probability is found by determining the area under the curve. The entire area under the curve, between - ∞and+ ∞, is 1.00. To find the probability of a statistic taking on a value within a certain range, you need to find the area under the curve within the range. For example, there are tables that will tell you that the area under the curve between t = 0 and t = +1 corresponds to a probability of about .34. Most importantly, because alpha has been set equal to .05, the area beyond 1.66 corresponds to a probability of .05. Fortunately, statistical programs calculate the areas for you, and you do not need to do the calculations yourself. Nevertheless, the essence of hypothesis testing is that if you obtain a value of t greater than 1.66, you will say, “This is not a very likely event if the null is true. Thus, the null hypothesis is probably not true because the alternative hypothesis provides a more likely explanation.” In making the decision to reject the null, however, you recognize that if the null is, in fact, true, you are making a type I error.
  • 33. While alpha provides assurance that the researcher has a small chance of making a type I error, you are also interested in what will happen if the null hypothesis is false—the real world expectation that is driving you to do the study. Figure 2 Now, in Figure 2, switch your focus from the curve on the left and attend to the curve on the right formed by the dashed line. This curve is based on the alternative hypothesis (i.e., that the unprimed group performs better than the primed group). To construct this curve based on the alternative hypothesis, a specific value for the difference in means had to be specified; in this case, the value of d = .21, the overall gender effect that Nguyen and Ryan (2008) found. Note, again, that the vertical line with t = 1.66 separates the values of the test statistic that lead to rejecting versus retaining the null hypothesis, and that the line is based on the null hypothesis. The statistical power of the test, (1-β), is the area under the curve with the dashed lines and to the right of the vertical line for t = 1.69. The area designated by beta (β), to the left of the vertical line, corresponds to the probability of a type II error, retaining the null if the null is actually false. In this example, note that the area corresponding to power (1-β) is less than the area corresponding to β. Hence, you can conclude that the power is less than 0.5 because the sum of the two areas is 1.0. Almost always, you would like statistical power to be greater than beta for the important hypothesis tests in your study. In this example, a plan to do an experiment with 50 participants in each group may be doomed. The statistical power of the test (.27) is relatively low, and the risk of making a type II error is relatively high. In other words, the statistical power of the test, as currently constructed, limits your ability to detect a gender effect of priming versus not priming if there is one. Power Analysis
  • 34. As the researcher, you have control of alpha, and you will set alpha when you are planning your study. Continuing with the example from the previous page, Figure 4 below shows what would happen to power if you change alpha, the probability of a type I error, to .15. Figure 2 Figure 2 Compare the curves in Figure 3 to the ones above in Figure 2; in that figure, alpha (α) was equal to .05. Notice that β becomes smaller, and power, (1-β), becomes larger. If you change α to .01, a relatively small value for the probability of a type I error, beta (β) becomes larger, and power becomes less. See Figure 4 below. Figure 3 Figure 3 Figure 4 Figure 4 · bullet In general, making alpha (α) smaller results in a decrease in the power of the statistical test, and making alpha larger results in greater power. This is because if you set a more stringent alpha (e.g., .01 instead of .05), it becomes more difficult to reject the null hypothesis. While .05 is a typical value for α, the decision of which value to use for α is up to the researcher. Letting alpha (α) equal .05 is certainly common practice. · bullet Many journal editors expect alpha (α) to equal .05. There are other times, however, when the researcher may wish to use a different value for alpha (α) depending on the severity of the consequences for making a type I error. For example, if you are studying whether or not a drug has serious side effects, with the null specifying that there are no serious side effects, you may
  • 35. want to have a more stringent alpha to lower your risk of saying that there aren’t side effects when there actually are; you may opt for a .01 alpha instead of a .05 alpha. Power and Effect Size A second factor that is related to the statistical power of a test is the effect size. There are several measures of effect size. With a comparison of two populations, Cohen’s d is often used. The value of d is the difference in population means between two groups in standard deviation units. According to Cohen’s rule of thumb, a value of d = .2 is considered a small effect, d = .5 is considered a medium sized effect, and d = .8 is considered a large effect. Let’s revisit the earlier example about planning a study to demonstrate race-based stereotype threat. Nguyen and Ryan (2008) note that overall race-based stereotype threat studies have resulted in an average d equal to about .32. Figure 5 below shows what you can expect if you induced a general racial stereotype threat in a rather typical way so that in the population d = .32, there are 50 participants in each group, and alpha = .05. Note that power has increased noticeably compared to the study examined in Figure 2. This is due to the effect size ( d = .32) in this figure being larger than the effect size ( d = .21) in Figure 2. Figure 5 Figure 5 There are instances in which stereotype effects as large as d = .64 have been identified in the samples being studied. If the population d is .64, the hypothesis test with alpha = .05 and 50 participants in each group will result in power equal to .93 as shown in Figure 6. This is a high value for statistical power, meaning that the researchers are very likely to detect an effect if d = .64 in the population. Figure 6
  • 36. Figure 6 Most researchers prefer to have the estimate of power be at least .80 before they are willing to conduct a study. So planning to do a study with 50 participants in each group may be a bad decision if the effect size in the population is small or moderate, as it was above in Figures 2 and 5. On the other hand, with a large effect (e.g., d = .64), a sample of 50 participants in each condition provides more than sufficient statistical power for most researchers. The Relationship Between Power and Sample Size Prior discussions have focused on testing hypotheses about population means, but you can also do hypothesis tests involving population proportions. In general, larger sample sizes give you more information to pin down the true nature of the population. You can, therefore, expect the sample mean and sample proportion obtained from a larger sample to be closer to the population mean and proportion, respectively. As a result, for the same level of confidence, you can report a smaller margin of error, and get a narrower confidence interval. In other words, larger sample sizes increase how much you trust your sample results. In the two scenarios below, you will see that a larger sample size results in a greater ability to reject the null when an effect actually exists in the population. Scenario: Examining Marijuana Use Imagine you are a researcher examining marijuana use at a certain liberal arts college and read through the scenario below. Step 1 You believe that marijuana use at the college is greater than the national average, for which large-scale studies have shown that about 15.7% of college students use marijuana (reported by the Harvard School of Public Health). Based on this belief, you perform the hypothesis test shown in Figure 9 below. · Note that p in this figure means population proportion
  • 37. and pˆ means sample proportion. On the other hand, p-value continues to have the same meaning as defined in the glossary. Because the p-value is greater than .05, the customary alpha level, the data do not provide enough evidence that the proportion of marijuana users at the college is higher than the proportion among all U.S. college students, which is .157. Step 2 Let’s make some small changes to the above problem. Suppose that in a simple random sample of 400 students from the college, 76 admitted to marijuana use as seen in Figure 8 below. Do the data provide enough evidence to conclude that the proportion of marijuana users among the students in the college (p) is higher than the national proportion, which is .157? Step 3 You now have a larger sample (400 instead of 100), and also the number of marijuana users is 76 instead of 19. The question of interest did not change, so if you carry out the test in this case, you are testing the same hypotheses seen below. Step 4 You select a random sample of size 400 and find that 76 are marijuana users, and the formula seen below. This is the same sample proportion as in the original problem, so it seems that the data give the same evidence. Step 5 However, when you calculate the test statistic, you see that actually this is not the case as seen in the formula below. Even though the sample proportion is the same (.19), because here it is based on a larger sample (400 instead of 100), it is 1.81 standard deviations above the null value of .157 (as opposed to .91 standard deviations in the original problem). The sampling distribution for the sample proportion has a smaller
  • 38. standard error because of the larger sample size. Step 6 The p-value here is .035, as opposed to .182 in the original problem. In other words, when Ho is true (i.e., if p = .157 at the certain college), it is quite unlikely (probability of .035) to get a sample proportion of .19 or higher based on a sample of size 400. When the sample size is 100, the probability of having a sample proportion greater than .19 is more likely (probability .182). The results here are important. With n = 400, the data provide enough evidence to reject Ho and conclude that the proportion of marijuana users at the college is higher than among all U.S. students. With n = 100, however, the evidence is insufficie nt to reject the null. Figure 9 summarizes these findings. You can see that results that are based on a larger sample carry more weight. A sample proportion of .19 based on a sample of size of 100 was not enough evidence that the proportion of marijuana users in the college is higher than .157. Recall that this conclusion (not having enough evidence to reject the null hypothesis) doesn't mean the null hypothesis is necessarily true; it only means that the particular study did not yield sufficient evidence to reject the null. It might be that the sample size was simply too small to detect a statistically significant difference, and a type II error was made. To summarize, you saw that when the sample proportion of .19 is obtained from a sample of size 400, it carries much more weight, and in particular, provides enough evidence that the proportion of marijuana users in the college is higher than .157 (the national figure). In this case, the sample size of 400 was large enough to detect a statistically significant difference. The following graphs show the power of the two tests if the population mean proportion p for the certain college is actually
  • 39. .19. Use the < and > icon to navigate between slides. · 1 · 2 Figure 10 Figure 11 Figure 12 Finally, Figure 12 shows how sample size affects the test for proportions concerning marijuana use at the liberal arts college. The graph is based on a hypothesis test with alpha = .05, the proportion for the null hypothesis equal to .157, and the population proportion for the liberal arts college = .19. In general, whether you are testing hypotheses about proportions, means, or other parameters, the larger the sample size, the greater the statistical power. Because of your interest in rejecting the null, you need to pay attention to how large your sample size will be prior to collecting data.