1. ORGANIZATIONAL BEHAVIOR AND HUMAN DECISION PROCESSES
Vol. 69, No. 3, March, pp. 265–275, 1997
ARTICLE NO. OB972687
Positive and Negative Hypothesis Testing by Cooperative Groups
PATRICK R. LAUGHLIN, VICKI J. MAGLEY, AND ELLEN I. SHUPE
University of Illinois at Urbana-Champaign
strategy or heuristic, “a tendency to test cases that are
In a rule induction problem positive hypothesis tests expected (or known) to have the property of interest
select evidence that the tester expects to be an example
rather than those expected (or known) to lack that prop-
of the correct rule if the hypothesis is correct, whereas
erty” (1987, p. 211). They summarize evidence showing
negative hypothesis tests select evidence that the test-
that this positive test strategy is an effective heuristic
er expects to be a nonexample if the hypothesis is cor-
in a wide range of hypothesis testing situations, includ-
rect. Previous research indicates the general effective-
ing rule learning, concept identification, judging a rule
ness of a positive test strategy for individuals, but there
has been very little research with cooperative groups. of the form “if p, then q,” learning from outcome feed-
We extend the analysis of Klayman and Ha (Psychologi- back, and judgments of contingency. Distinguishing
cal Review, 1987) of ambiguous verification or conclu- this effective positive test strategy from a deleterious
sive falsification of five possible types of hypotheses “confirmation bias” that fails to falsify in the strict (e.g.,
by positive and negative tests by emphasizing the im-
Popper, 1959) or modified (e.g. Lakatos, 1970; Meehl,
portance of further examples following hypothesis
1990) prescriptive sense proposed by philosophers of
tests. In two experiments four-person cooperative
science, they conclude: “The appropriateness of human
groups solved rule induction problems by proposing a
hypothesis-testing strategies and prescriptions about
hypothesis and selecting evidence to test the hypothe-
optimal strategies must be understood in terms of the
sis on each of four arrays on each trial. In different
conditions the groups were instructed to use different interaction between the strategy and the task at hand”
combinations of positive and negative tests on the four (p. 211).
arrays. Positive tests were more likely to lead to fur- Although hypothesis testing by cooperative groups is
ther examples than negative tests, and the proportion
an important basic issue in social and organizational
of correct hypotheses corresponded to the proportion
psychology, there has been very research on the effec-
of positive tests, in both experiments. We suggest that
tiveness of positive and negative hypothesis tests by
positive tests are more effective than negative hypoth-
cooperative groups. Indeed, to our knowledge only two
esis tests in generating further evidence, and thus in
previous experiments have explicitly assessed the effec-
inducing the correct rule, in experimental rule induc-
tion tasks with a criterion of certainty imposed by the tiveness of positive and negative hypothesis tests for
researcher. q 1997 Academic Press cooperative groups, both with a cooperative rule induc-
tion paradigm adapted from the competitive game
“Eleusis” (Abbott, 1977; Gardner, 1977; Romesburg
Hypothesis testing is an important area of psycholog- 1979).
ical theory and research. A basic issue is the effective-
In Eleusis the Dealer chooses a rule based on ordinary
ness of positive hypothesis tests and negative hypothe-
playing cards, places an example of the correct rule face
sis tests. In a positive test the person examines or
up on the table, shuffles two decks (104 cards) together,
generates evidence that is expected to have the property
and deals each Player a hand of 14 cards. Each Player
or event of interest, whereas in a negative test the
in turn plays cards which the Dealer classifies as an
person examines or generates evidence that is not ex-
example or nonexample of the correct rule, placing ex-
pected to have the property or event of interest.
amples face up to the right of the initial example in
Klayman and Ha (1987, 1989) and Klayman (1995)
the order of play and nonexamples below the last card
propose that many obtained results in research on hy-
played. The objective is to get rid of all of one’s cards,
pothesis testing may be understood by a positive test
either by playing examples or correctly showing the
Dealer that one has no examples to play when only five
Address correspondence and reprint requests to Patrick R. Lau-
cards remain in one’s hand. The Player receives two
ghlin, Department of Psychology, University of Illinois, 603 E. Daniel
Street, Champaign, IL 61820. E-mail: plaughli@s.psych.uiuc.edu. further cards for every nonexample played. A scoring
265 0749-5978/97 $25.00
Copyright q 1997 by Academic Press
All rights of reproduction in any form reserved.
2. 266 LAUGHLIN, MAGLEY, AND SHUPE
system based on the array of card plays and the re- of the correct rule, the experimenter places it below
the known example. Each group member then makes
maining cards in their hands allocates points to Dealer
and Players. a second hypothesis, the group makes a hypothesis, and
the group plays a second card. Again, example cards are
Both Gorman, Gorman, Latta, and Cunningham
(1984) and Laughlin and Futoran (1985) converted this placed to the right of the last example and nonexample
cards below the last card in the order of play. This
competitive game to cooperative group rule induction
with the basic procedure of playing cards which are procedure continues for 10 trials of hypotheses and card
selections, after which the group proposes a final hy-
placed as examples and nonexamples in a progressive
array of evidence chosen by the group. Gorman et al. pothesis. The experimenter does not indicate whether
the member or group hypotheses are correct or incorrect
(1984) found better performance for groups who were
instructed to use negative hypothesis tests, whereas until after the final hypothesis. Table 1 gives an illus-
tration for the correct rule “two diamonds alternate
Laughlin and Futoran (1985) found that control groups
who used positive and negative tests as they desired with two clubs” with the known initial example of the
eight of diamonds (8D).
performed better than both groups who were instructed
to use positive tests and groups who were instructed As in virtually all research on laboratory rule induc-
tion and rule discovery, there are three simplifying as-
to use negative tests. Several procedural variations
(which we consider in the General Discussion) probably sumptions (Klayman & Ha, 1987). First, the experi-
menter chooses a correct rule and gives error-free
account for these different results. Thus, the small
amount of previous research on positive and negative feedback whether each card selection is an example or
nonexample of the correct rule. Second, only one rule
hypothesis testing by cooperative groups is inconclu-
sive. is correct, although other rules may be plausible or
consistent with the evidence, and there is no feedback
Accordingly, the following two experiments assessed
the effectiveness of positive and negative hypothesis on the degree of incorrectness. Third, the correct rule
requires both sufficiency and necessity. A hypothesis is
testing for cooperative groups in rule induction prob-
lems. We first describe a simple rule induction para- nonplausible if it predicts a card will be in the set de-
fined by the correct rule when it is not (false positive),
digm and then describe an expanded version that allows
a more comprehensive assessment of positive and nega- or predicts a card will not be in the set defined by the
correct rule when it is (false negative).
tive hypothesis tests. We present the theoretical analy-
sis of Klayman and Ha (1987) of the inferences that
EXPERIMENT 1
may be drawn from positive and negative tests of five
types of hypotheses, and then extend their analysis by Expanding our previous illustrative rule induction
considering the importance of examples or nonexam- paradigm, the groups in the present experiment were
ples following positive and negative tests. From this
TABLE 1
analysis we predict the proportion of examples, the pro-
portion of strategic hypotheses, and the effectiveness Illustration of Card Plays and Hypotheses for One Array
of positive and negative test strategies for six conditions
Card plays
in Experiment 1 and five conditions in Experiment 2.
8D 6D 2C 8C 6D 2D
The objective of the rule induction problems is to
9H 2H 4C
induce a correct rule based on a standard deck of 52
8H
playing cards with four suits (clubs, C; diamonds, D; 4D
hearts, H; spades, S) of 13 cards (ace, 1; two, 2; three,
Hypothesis 1: Even diamonds (after known example of
3; . . . , king, 13). The rule may be based on suit (e.g.,
8D)
“diamonds”), number (e.g., “eights”), or any combina- Hypothesis 2: Red (after first card play of 6D)
tion of numerical and logical operations on suit and Hypothesis 3: Diamonds (after second card play of 9H)
number (e.g., “even diamonds below the ten,” “even dia- Hypothesis 4: Diamonds
Hypothesis 5: Diamonds six and above and clubs
monds alternate with odd spades”). The problem begins
Hypothesis 6: Diamonds and clubs
with a single card that is known to be an example of
Hypothesis 7: Even diamonds and even clubs
the rule, placed face up on a table. Hypothesis 8: Two red and two black alternate
On the first trial each group member writes a hypoth- Hypothesis 9: Diamonds six and above and all clubs
esis on a hypothesis sheet. The group then makes a Hypothesis 10: Two even diamonds alternate with two
even clubs
group hypothesis and chooses one of the 52 cards. If
Final hypothesis: Two diamonds alternate with two clubs
the selected card is an example of the correct rule, the
(after last card play of 2D)
experimenter places it on the table to the right of the
Note. Correct rule is “two diamonds alternate with two clubs.”
known example. If the selected card is not an example
3. GROUP HYPOTHESIS TESTING 267
instructed to use positive and/or negative hypothesis Array 1 Array 2
tests on four separate arrays of cards on each of the 8D 8D
ten trials. The problem began with the same known 3C 8H
example on each of the four arrays. There were six Array 3 Array 4
experimental conditions. To illustrate these six condi- 8D 8D 7D
tions, assume the correct rule “two diamonds alternate 8S
with two clubs,” and the given example of the 8D on
each of the four arrays. Assume that the first group In the Control Condition there were no instructions to
hypothesis is “even diamonds.” In the PPPP Condition use positive or negative hypothesis tests, so the groups
the groups were instructed to use a positive test (P) of could use any combination of positive and negative tests
their current hypothesis on each of the four arrays. on the four arrays.
After playing one card on each of the four arrays they Similarly, on each of the second and subsequent trials
were given feedback whether each of the four cards was the groups proposed a hypothesis and then used posi-
an example or nonexample. Hence possible card plays tive or negative hypothesis tests on the four arrays as
and resulting feedback on the first trial would be: instructed in the first five conditions, or as they wished
in the Control Condition.
Array 1 Array 2 Will all positive hypothesis tests (PPPP Condition),
8D 6D 8D 4D all negative hypothesis tests (NNNN Condition), fixed
Array 3 Array 4 proportions of positive and negative tests (PPPN,
8D 2D 8D QD PPNN, and PNNN Conditions) or unconstrained posi-
tive and negative tests (Control Condition) result in
In the PPPN Condition the groups were instructed more effective performance with this expanded rule in-
to use positive tests on Arrays 1, 2, and 3 and a negative duction paradigm? In an excellent theoretical analysis,
test (N) on Array 4. Hence possible card plays and re- Klayman and Ha (1987) discuss the five possible types
sulting feedback would be: of hypotheses and the inferences that may be made
from the results of positive tests and negative tests of
Array 1 Array 2
each type. Although they illustrate their analysis with
8D 6D 8D 4D
the Wason (1960) 2-4-6 Task, it generalizes (with one
Array 3 Array 4
exception) to other rule induction paradigms. To illus-
8D 2D 8D 7D
trate their analysis, assume the hypotheses and card
plays of Table 1 and the correct answer “two diamonds
In the PPNN Condition the groups were instructed
alternate with two clubs.”
to use positive tests on Arrays 1 and 2 and negative
Embedded hypotheses are based on the appropriate
tests on Arrays 3 and 4, such as:
relationships but are too specific, such as Hypothesis
10, “two even diamonds alternate with two even clubs.”
Array 1 Array 2
Overlapping hypotheses are plausible but based on
8D 6D 8D 4D
other relationships, such as Hypothesis 1, “even dia-
Array 3 Array 4
monds.” Surrounding hypotheses are based on the ap-
8D 8D 7D
propriate relationships but are too general, such as Hy-
8S
pothesis 8, “two red and two black alternate.” Disjoint
(nonplausible) hypotheses are inconsistent with the evi-
In the PNNN Condition the groups were instructed
dence, such as Hypothesis 6, “diamonds and clubs.”
to use a positive test on Array 1 and negative tests on
Target (correct) hypotheses are the correct answer cho-
Arrays 2, 3, and 4, such as:
sen by the experimenter, such as the Final Hypothesis,
“two diamonds alternate with two clubs.”
Array 1 Array 2
Klayman and Ha (1987) analyze the inferences of
8D 6D 8D
ambiguous verification or conclusive falsification that
8H
may be drawn from the Type of Test (Positive or Nega-
Array 3 Array 4
tive) and Results (Yes, in the Target Set T, or No, not
8D 8D 7D
in the Target Set T) for the five types of hypotheses in
8S
five 2 3 2 figures. In the current rule induction problem
the result “Yes, in the Target Set” is an example card,
In the NNNN Condition the groups were instructed
to use negative tests on all four arrays, such as: whereas the result “No, not in the Target Set” is a
4. 268 LAUGHLIN, MAGLEY, AND SHUPE
nonexample card. We combine their five figures in the hypothesis. Hence, positive tests of surrounding
hypotheses should be more effective because they may
Table 2.
Embedded hypotheses recognize the correct relation- conclusively falsify hypotheses that are too general.
Disjoint (nonplausible) hypotheses are inconsistent
ships but are too specific. As indicated in Table 2, an
example following a positive test of an embedded hy- with the evidence. Although it is somewhat paradoxical
to consider the inferences that may be drawn from posi-
pothesis ambiguously verifies the hypothesis, whereas
a nonexample is impossible. An example following a tive and negative tests of hypotheses that contradict
the available evidence, a positive test followed by a
negative test of an embedded hypothesis conclusively
falsifies the hypothesis, whereas a nonexample ambigu- nonexample, or a negative test followed by an example,
will conclusively falsify the nonplausible hypothesis. In
ously verifies the hypothesis. Hence negative tests of
embedded hypotheses should be more effective because contrast to the Wason 2-4-6 task, where an example
following a positive test of a nonplausible hypothesis
they may conclusively falsify hypotheses that are too
specific. is impossible, in the current rule induction task an
example may follow a positive test of a nonplausible
Overlapping hypotheses are plausible but based on
other relationships than those of the correct rule. An hypothesis. To illustrate, assume the correct rule “two
diamonds alternate with two clubs” and the known ini-
example following a positive test of an overlapping hy-
pothesis ambiguously verifies the hypothesis, whereas tial example 8D. A positive test of the nonplausible
hypothesis “odd diamonds” on the first trial with the
a nonexample conclusively falsifies it. An example fol-
lowing a negative test conclusively falsifies the hypoth- 7D results in an example of the correct rule. Hence,
positive and negative tests of nonplausible hypotheses
esis, whereas a nonexample ambiguously verifies it.
Hence, positive and negative tests of overlapping should be equally effective.
A positive test of the correct hypothesis will necessar-
hypotheses should be equally effective.
Surrounding hypotheses recognize the correct rela- ily be followed by an example, ambiguously verifying
the hypothesis, and a negative test will necessarily be
tionships but are too general. An example following a
positive test of a surrounding hypothesis ambiguously followed by a nonexample, also ambiguously verifying
the hypothesis. Hence, positive and negative tests
verifies the hypothesis, whereas a nonexample conclu-
sively falsifies it. An example is impossible following a should be equally effective.
Extending these inferences of conclusive falsification
negative test, and the nonexample ambiguously verifies
and ambiguous verification from positive and negative
tests of the five types of hypotheses, we now emphasize
TABLE 2
the importance of the resulting examples or nonexam-
Inferences from Positive and Negative Tests of Five Types ples. Examples provide further evidence for what the
of Hypotheses on the Wason (1960) 2-4-6 Task
correct rule is, whereas nonexamples indicate what the
(Klayman & Ha, 1987)
correct rule is not. This further evidence should make
Result the correct relationships more likely to be perceived
Hypothesis
and tested. Hence we conjecture that further examples
and test Example Nonexample
will be more likely to lead to induction of the correct
Embedded
rule than further nonexamples.
Positive Ambiguous verification Impossible
Positive tests of embedded and correct hypotheses
Negative Conclusive falsification Ambiguous verification
must necessarily result in examples, and negative tests
Overlapping
Positive Ambiguous verification Conclusive falsification of surrounding hypotheses must necessarily result in
Negative Conclusive falsification Ambiguous verification nonexamples. Beyond this, we conjecture that positive
Surrounding
tests of overlapping hypotheses are more likely to result
Positive Ambiguous verification Conclusive falsification
in further examples than negative hypothesis tests in
Negative Impossible Ambiguous verification
the current rule induction problems. The problems be-
Nonplausible
Positive Impossible for 2-4-6 task Conclusive falsification gin from minimal information, a single example of the
Negative Conclusive falsification Ambiguous verification correct rule, such as the 8D of the illustration in Table
Correct
1. The correct rules involve patterns of evidence that
Positive Ambiguous verification Impossible
are not apparent until a number of example cards have
Negative Impossible Ambiguous verification
been played. Since overlapping hypotheses share exam-
Note. In the current rule induction task an example may follow a ples with the correct hypothesis by definition, positive
positive test of a nonplausible hypothesis. To illustrate, assume the
tests should be more likely to result in further examples
correct rule “two diamonds alternate with two clubs” and the known
than negative tests, which should be less likely to share
initial example 8D. A positive test of the nonplausible hypothesis
“odd diamonds” with the 7D results in an example of the correct rule. examples with the correct hypothesis. In particular, on
5. GROUP HYPOTHESIS TESTING 269
the early trials of the rule induction problem virtually CCSS, HHDD, HHCC, HHSS, SSDD, SSCC, and
SSHH. There were two replications of the first eight
all hypotheses should be overlapping hypotheses that
are consistent with the evidence but based on other rules and one replication of the last four rules in each
of the six Conditions. Depending upon the correct rule,
relationships than those of the correct rule. Many of
these overlapping hypotheses should be based on other the initial example card was the 8D, 8C, 8H, or 8S. The
basic instructions were as follows:
relationships, such as “diamonds,” “even diamonds,” or
“diamonds eight and below” for which a positive test This is an experiment in problem solving. The objective is to
will result in a further example. figure out a correct rule based on playing cards. Aces have the
value 1, deuces 2, and so on to tens 10, jacks 11, queens 12, and
Thus, if positive tests are more likely to lead to fur-
kings 13. The rule may be based on any characteristics of the
ther examples than negative tests, and examples are
cards, including suit, number, numerical and logical operations,
more useful than nonexamples in inducing the correct
alternation, and so on. For example, if the rule were “diamonds,”
rule because they provide further evidence, the number all diamonds would fit the rule, and all hearts, clubs, and spades
of correct hypotheses should correspond to the propor- would not fit the rule. I will start you with one card that does
fit the rule. The first step will be for each of you to write your
tion of positive tests on the four arrays.
own hypothesis on your individual hypothesis sheet. Then the
These considerations lead to an interesting question.
four of you will decide on a group hypothesis, which one of you
How may the groups in the NNNN Condition who are will write on the group hypothesis sheet (the group recorder was
constrained to use all negative tests obtain examples? randomly designated by a roll of a die). Then you will play any
Assume that a group believes the correct rule is “two one of the 52 cards you choose on each of the four arrays. After
you choose a card for all four arrays, I will tell you whether or
diamonds alternate with two clubs” after the sequence
not each card also fits the rule. If the card you play also fits the
of examples 8D 2D 2C 8C on one of the four arrays.
rule, I will place it to the right of the first card. If the card does
They wish to obtain a further example of this rule,
not fit the rule, I will place it below the first card. Then you will
which would be a diamond if their hypothesis is correct, each make your second individual hypothesis, make your second
but are constrained by instructions to conduct a nega- group hypothesis, and play a second card on each array. If this
second card fits the rule, I will place it to the right of the last
tive test by playing a nondiamond. They may propose
card that fits the rule and if it does not fit the rule, I will place
the hypothesis “two diamonds alternate with two clubs
it below the last card played. This procedure will continue for
alternate with two hearts” and conduct a negative test 10 trials of individual hypotheses, group hypothesis, and group
of it by playing a diamond, which will be an example card play. After the 10 trials you will make your final individual
if their actual preferred hypothesis “two diamonds al- hypotheses and your final group hypothesis. I will not say
whether or not your first ten hypotheses are correct, but I will
ternate with two clubs” is correct. By analogy to social
tell you whether or not your final hypothesis is correct at the
choice theory (e.g., Sen, 1970), in which an individual
end of the experiment.
or group may vote against their true preference order
to achieve their objective, we call such hypotheses stra- The experimenter then demonstrated this procedure
for four example rules: “diamonds,” “even diamonds,”
tegic hypotheses.
In summary, these considerations lead to three pre- “even diamonds or clubs above the six,” and “odd spades
alternate with even hearts.” Depending upon the condi-
dictions:
tion (PPPP, etc.) the experimenter next explained the
PREDICTION 1. There will be a higher total proportion of examples
procedure of positive or negative tests of the current
following positive hypothesis tests than negative hypothesis
group hypothesis on each of the four arrays.Within the
tests.
PREDICTION 2. There will be more strategic hypotheses for NNNN constraints of positive or negative tests, the card plays
than each of PPPP, PPPN, PPNN, and PNNN, which will not could be the same or different on each trial. The experi-
differ significantly from each other.
menter monitored each card selection to assure that it
PREDICTION 3. The order of total correct hypotheses will be
was a positive test or negative test of the current group
PPPP . PPPN . PPNN . PNNN . NNNN.
hypothesis as appropriate for the condition. There was
no mention of positive and negative tests in the Con-
Method
trol Condition.
Discussion was completely free within the groups.
The subjects were 480 students in introductory psy-
chology courses at the University of Illinois at Urbana- No group decision rule (e.g., unanimity, majority) for
hypotheses or card plays was imposed or implied by
Champaign who participated in partial fulfillment of
course requirements. They were randomly assigned to the instructions. Several decks of cards (sorted by suits
and arranged in ascending order from the ace to the
20 four-person groups in each of the six between-sub-
jects conditions. king) were available, so the same card could be played
as many times as desired on different arrays and trials.
The correct rules were the 12 possible alternations
of doubles of diamonds (D), clubs (C), hearts (H), and The experimenter recorded the trial number of each
hypothesis judged to be strategic. After the problem was
spades (S): DDCC, DDHH, DDSS, CCDD, CCHH,
6. 270 LAUGHLIN, MAGLEY, AND SHUPE
completed, the experimenter explained the meaning of hypotheses, and the experimenter and group judgments
agreed on 97% of the hypotheses.
strategic hypotheses and asked the group to indicate
which hypotheses were strategic on their group hypoth- Figure 1 gives the proportions of strategic hypotheses
for blocks of two trials for the five instruction conditions
esis sheet. The experimenter then told the subjects the
correct rule, gave them an oral summary of the purposes (strategic hypotheses do not make sense for the Control
Condition). As is evident in Fig. 1, there were consider-
of the research, answered any questions, and thanked
them for their participation. ably more strategic hypotheses in the NNNN Condition
than the other four conditions. The overall proportions
of strategic hypotheses were .16 for PPPP, .10 for PPPN,
Results and Discussion
.05 for PPNN, .21 for PNNN, and .59 for NNNN. There
Proportion of examples. Table 3 gives the mean pro- was a significant main effect of Condition, F(4, 95 5
portion of examples for each of the four arrays for the six 21.59), p , .001, MSe 5 4.24. Newman–Keuls compari-
experimental conditions. A 6(Conditions) 3 4(Arrays) sons indicated more strategic hypotheses for NNNN
analysis of variance with repeated measures on the than each of the other four conditions (all p , .001),
second factor indicated a significant main effect of Con- which did not differ from each other except for more
ditions, F(5, 114) 5 29.60, p , .001, MSe 5 6.63, strategic hypotheses for PNNN than PPNN, p , .05.
a significant main effect of Arrays, F(3, 342) 5 58.70, This supports Prediction 2.
p , .001, MSe 5 2.06, and a significant Conditions 3 If the NNNN groups proposed strategic hypotheses
Arrays interaction, F(3, 342) 5 22.94, p , .001. in order to obtain further examples we would expect the
All four simple main effects of Conditions for Arrays conditional probability of an example given a strategic
were significant, F(5, 114) 5 10.97, p , .001 for Array hypothesis to be greater than the conditional probabil-
1; F(5, 114) 5 34.45, p , .001 for Array 2; F(5, 114) 5 ity of an example given a nonstrategic hypothesis.
61.39, p , .001 for Array 3: and F(5, 114) 5 59.04, p , These respective conditional probabilities were .57 and
.001 for Array 4. Newman–Keuls comparisons were .24, x2
(1) 5 85.34, p , .001. The use of strategic hypothe-
then conducted within the simple main effects of Condi- ses by these NNNN groups is evidence that they real-
tions for Arrays. Inspection of the patterns of significant ized the value of further examples in inducing the cor-
differences within each Array in Table 3 indicates the rect rule.
predicted greater probabilities of examples for arrays
Five types of hypotheses. Figure 2 gives the propor-
with instructions to use positive tests than for arrays
tions of embedded, overlapping, surrounding, nonplau-
with instructions to use negative tests.
sible, and correct hypotheses for the 11 trials over the
Although positive tests of embedded and correct
six conditions. As evident in Fig. 2, overlapping hypoth-
hypotheses must necessarily result in examples, posi-
eses predominated on the early trials, supporting our
tive tests of overlapping hypotheses may result in exam-
assumption, and there were relatively few embedded
ples or nonexamples. There was a higher proportion of
and surrounding hypotheses. Figure 3 gives the propor-
examples for positive tests of overlapping hypotheses
tions of embedded, overlapping, surrounding, nonplau-
(.48) than negative tests (.30), x2
(1) 5 74.32, p , .001.
sible, and correct hypotheses for each of the six condi-
In summary, these results support Prediction 1 that
tions over the 11 trials.
positive hypothesis tests will be more likely to be fol-
lowed by an example than negative hypothesis tests.
Strategic hypotheses. The group members had no
difficulty understanding the meaning of strategic
TABLE 3
Mean Proportion of Examples: Experiment 1
Condition
Cont PPPP PPPN PPNN PNNN NNNN
Array 1 .65bc .71ab .74a .62c .57c .45
Array 2 .65b .75a .75a .64b .28 .45
Array 3 .68a .74a .75a .17 .29 .46
Array 4 .67a .70a .19bc .13c .26b .43
FIG. 1. Proportions of strategic hypotheses for blocks of two trials:
Note. Within each row means without a common subscript differ
significantly by Newman–Keuls comparisons. Experiment 1.
7. GROUP HYPOTHESIS TESTING 271
predicted order, but the groups who were instructed to
use positive tests on at least two arrays (PPPP, PPPN,
and PPNN) did not differ significantly from each
other.
EXPERIMENT 2
Although the order of correct hypotheses for the
five instruction conditions in Experiment 1 was as
predicted with the reversal of PPPP and PPPN, in-
structions to use positive tests on at least two arrays
resulted in comparable proportions of correct hypothe-
FIG. 2. Proportions of embedded, overlapping, surrounding, non-
plausible, and correct hypotheses for 11 trials: Experiment 1. ses. Similarly, the Control groups who used positive
and negative tests as they preferred performed at the
level of groups instructed to use positive tests on at
Total correct hypotheses. As indicated in Fig. 3, the
least two arrays. One possible reason for this is that
proportion of correct hypotheses was .45 for PPPP, .52
the problems were relatively easy with the large
for PPPN, .41 for PPNN, .35 for PNNN, .16 for NNNN,
amount of information available from four arrays of
and .52 for Control. This corresponded to the predicted
card selections. Although the number of examples,
order of PPPP . PPPN . PPNN . PNNN . NNNN,
and hence the amount of evidence, increased with
with the reversal of PPPP and PPPN.
positive tests, there was sufficient information with
The main effect of Condition for the proportions of
the examples from positive tests on two arrays. Accord-
total correct hypotheses was significant, F(5, 114) 5
ingly, Experiment 2 used more difficult rules, so that
8.72, MSe 5 .042, p , .001. Newman–Keuls compari-
increasing numbers of positive tests, and hence in-
sons indicated a higher proportion of correct hypotheses
creasing numbers of examples, should result in bet-
for each of Control, PPPP, PPPN, PPNN, and PNNN
ter performance.
than NNNN (all p , .001 except PNNN p , .01), indicat-
The correct rules were alternations of triples of two
ing better performance if the groups were instructed or
different suits, such as “three diamonds alternate with
allowed to use positive hypothesis tests on at least one
three clubs.” We expected these rules to be considerably
array. There was a higher proportion of correct hypothe-
more difficult than the alternations of doubles of suits
ses for both Control and PPPN than PNNN (both
(e.g., “two diamonds alternate with two clubs”) of Exper-
p , .01). There was no significant difference between
iment 1, and therefore we expected positive hypothesis
Control, PPPP, PPPN, and PPNN, indicating compara-
tests to be more effective than negative hypothesis
ble performance for groups who were instructed to use
tests.
positive hypothesis tests on at least two arrays and the
As in Experiment 1, there were four arrays and 10
Control Condition. These results generally support the
trials of group member hypotheses, group hypothesis,
and card selections. There were five conditions of in-
structions to use positive tests (P) or negative tests (N)
on the first five trials and the second five trials: (1)
positive tests on the first five and positive tests on the
second five (PP), (2) positive tests on the first five and
negative tests on the second five (PN), (3) negative tests
on the first five and positive tests on the second five
(NP), (4) negative tests on the first five and negative
tests on the second five (NN), and (5) no instructions
to use positive or negative tests (Control). These in-
structions assured that the PP groups would have twice
as many positive tests as the PN and NP groups, and
the NN groups would have no positive tests, thus pro-
viding a relatively greater difference in positive tests
than the PPPP, PPPN, PPNN, PNNN, and NNNN Con-
ditions of Experiment 1.
From the considerations in the Introduction we made
FIG. 3. Proportions of embedded, overlapping, surrounding, non-
plausible, and correct hypotheses for six conditions: Experiment 1. three predictions:
8. 272 LAUGHLIN, MAGLEY, AND SHUPE
PREDICTION 1. There will be a higher proportion of examples F(4, 55) 5 9.20, p , .001. Both of the simple main
following positive hypothesis tests than negative hypothesis effects of Conditions for the First Block of Trials
tests.
and Conditions for the Second Block of Trials were
PREDICTION 2. The order of strategic hypotheses on the first five
significant, F(4, 55) 5 3.12, p , .05; F(4, 55) 5 12.30,
trials will be: (NP and NN) . (PP and PN). The order of strategic
hypotheses on the second five trials will be: (PN and NN) . (PP p , .001, respectively. Newman–Keuls tests within
and NP). the simple main effect of Conditions for the First
PREDICTION 3. The order of total correct hypotheses will be: Block of Trials indicated more examples for each of
PP . (PN 5 NP) . NN.
PP, PN, and Control than each of NN and NP, all
p , .05. The PP, PN, and Control Conditions did not
Method
differ significantly from each other. As predicted, there
The subjects were 240 students in introductory psy- were more examples on the first five trials for the
chology courses at the University of Illinois at Urbana- PP and PN Conditions who were instructed to use
Champaign who participated in partial fulfillment of positive tests than the NP and NN Conditions who
course requirements. There were 12 replications in each were instructed to use negative tests.
of the five experimental conditions. Newman–Keuls comparisons within the simple
The correct rules were the 12 possible alternations main effect of Conditions for the Second Block of
of triples of two different suits, such as “three diamonds Trials indicated fewer examples for PN than each of
alternate with three clubs,” and “three diamonds alter- the other four conditions, all p , .001. There were
nate with three hearts.” Each of the 12 rules was used more examples for PP than NP, p , .05. As predicted,
for one replication of the five conditions. The general there were more examples on the second five trials
instructions and procedures were the same as in Experi- for the PP and NP Conditions that were instructed
ment 1, with appropriate modifications for the different to use positive tests than the PN Condition. Contrary
instructions to use positive or negative hypothesis tests to prediction, there were not more examples on the
on the first five trials and second five trials in the PP, second five trials for the PP and NP Conditions that
PN, NP, and NN Conditions. were instructed to use positive tests than the NN
Condition, which we interpret as the effectiveness of
Results and Discussion using strategic hypotheses to obtain examples in the
NN Condition.
Proportion of examples. Figure 4 gives the mean
proportion of examples for the first five and second
Strategic hypotheses. Figure 5 gives the proportion
five trials for the five conditions. A 5(condition) 3
of strategic hypotheses for the first five and second five
2(blocks of five trials) ANOVA indicated a significant
trials for the four instruction conditions. A 4(condition)
main effect of Conditions, F(4, 55) 5 4.92, p , .002,
3 2(blocks of five trials) ANOVA indicated a significant
MSe 5 12.41. There was a significant effect of trial
main effect of Conditions, F(3, 44) 5 12.17 p , .001,
blocks, F(1, 55) 5 16.40, p , .001, MSe 5 9.82, and
MSe 5 1.8816, a significant effect of Blocks, F(1, 44) 5
a significant Condition 3 Trial Blocks interaction,
FIG. 4. Proportions of examples for first five trials and second FIG. 5. Proportions of strategic hypotheses for first five trials
and second five trials: Experiment 2.
five trials: Experiment 2.
9. GROUP HYPOTHESIS TESTING 273
43.52, p , .001, MSe 5 1.2756, and a significant Condi- negative test of the five types of hypotheses, as exam-
tion 3 Blocks interaction, F(3, 44) 5 7.54, p , .01. A ples provide further evidence to induce the correct rule
planned contrast for the First Block of Trials indicated whereas nonexamples indicate what it is not. We conjec-
more strategic hypotheses for (NP and NN) than (PP tured that positive tests would be more likely to result
and PN), F(1, 44) 5 10.76, p , .001, supporting the in examples than negative tests, and this was supported
first part of Prediction 2. A planned contrast for the in both experiments. Assuming this importance of ex-
Second Block of Trials indicated more strategic hypoth- amples and the greater probability of examples follow-
eses for (PN and NN) than (PP and NP), F(1, 44) 5 ing positive tests, we predicted that groups who were
17.51, p , .001, supporting the second part of Predic- constrained to use all negative tests would use strategic
tion 2. hypotheses. This prediction was supported in both ex-
periments, indicating that these groups realized the
Total correct hypotheses. Figure 6 gives the propor-
importance of examples.
tions of embedded, overlapping, surrounding, nonplau-
Our analysis led to the prediction that the order of
sible, and correct hypotheses for the five conditions. As
total correct hypotheses for the instructions to use posi-
indicated in Fig. 6, the proportions of correct hypotheses
tive hypothesis tests (P) or negative hypothesis tests
were .21 for PP, .11 for PN, .20 for NP, and .08 for
(N) on the four arrays would be PPPP . PPPN . PPNN
NN, supporting the predicted order. The main effect of
. PNNN . NNNN in Experiment 1. The proportion of
Conditions was significant, F(4, 55) 5 3.95, p , .01.
correct hypotheses followed this order, but groups who
As predicted, Newman–Keuls comparisons indicated a
were instructed to use at least two positive tests and the
higher proportion of correct hypotheses for PP than
uninstructed Control Condition performed comparably.
each of PN and NN, and a higher proportion for NP
Since this may have been due to relatively easy rules
than NN, all p , .05, and a nonsignificant difference
and the large amount of information from four arrays,
between PN and NP. Contrary to prediction, there was
we used more difficult rules in Experiment 2. The pre-
not a significant difference between PP and NP. The
dicted order of correct hypotheses of PP . (PN 5 NP) .
Controls had a significantly higher proportion of correct
NN was supported, although there was not a significant
hypotheses than each of PN, p , .05, and NN, p , .01,
difference between PP and NP.
and did not differ from PP and NP.
We interpret both experiments as extending previous
evidence for the effectiveness of a positive test strategy
GENERAL DISCUSSION
or heuristic on the Wason 2-4-6 task, judgments of if-
then relationships and covariance, and concept attain-
Klayman and Ha (1987) analyzed the inferences of
ment (for reviews, see Klayman, 1995; Klayman & Ha,
conclusive falsification and ambiguous verification of
1987, 1989). Unlike the Wason (1960) 2-4-6 task, an
hypotheses that may be drawn from positive and nega-
obvious hypothesis such as “increasing by two” was not
tive tests of embedded, overlapping, surrounding, dis-
embedded within the more general correct hypothesis
joint (nonplausible), and correct hypotheses. Extending
“increasing numbers,” in the given evidence, so that a
their analysis, we emphasized the importance of the
negative test strategy was not necessarily effective a
probability of a further example following a positive or
priori. Moreover, in contrast to the 2-4-6 task, there
were very few embedded hypotheses in either experi-
ment. In contrast to the Wason task where each triple
generated by the problem solver is an independent test,
in the current paradigm each card play is added to a
progressive array of evidence, providing a closer analog
to the development of evidence in domains of hypothesis
testing outside of laboratory experiments.
This progressive development of evidence over suc-
cessive trials of hypothesis testing extends typical re-
search on judgments of if-then relationships and judg-
ments of covariance, where the fixed amount of evidence
is prearranged and presented by the experimenter. In
contrast to research on concept attainment with the
paradigm of Bruner, Goodnow, and Austin (1956), there
was an indeterminate rather than determinate number
FIG. 6. Proportions of embedded, overlapping, surrounding, non-
plausible, and correct hypotheses for five conditions: Experiment 2. of initially possible correct hypotheses, so that a single
10. 274 LAUGHLIN, MAGLEY, AND SHUPE
correct hypothesis could not be established with cer- that the criterion of scientific as opposed to nonscientific
theory is falsifiability, and that scientific experiments
tainty by a series of hypothesis tests that eliminate all
but one possibility. Again, this is a more realistic analog should therefore be designed to attempt to falsify rather
than to support prevailing theory. His analysis was
of hypothesis testing in domains outside of laboratory
rule induction and rule discovery. based on mature sciences such as theoretical physics.
Such sciences have passed through a natural history
In this rule induction paradigm there are two types
of correct rules, contingent and noncontingent. With phase in which scientists reach general agreement on
the phenomena of interest, appropriate terminology,
contingent rules a given card may be an example or a
nonexample depending upon the order of play, and with permissible operations and procedures, and the bound-
ary of the domain. Given this agreement, a large
noncontingent rules a given card is either an example
or a nonexample regardless of the order of play. To amount of accepted evidence exists to be explained by
well-developed competing theories. Experiments may
illustrate, with the contingent rule “two diamonds al-
ternate with two clubs” a diamond following one dia- usefully be designed to falsify these approximately cor-
rect competing theories. In contrast, the rule induction
mond is an example, but a diamond following two dia-
monds is a nonexample. With the noncontingent rule task begins with the minimal evidence of the single
known example of the correct rule, and further evidence
“diamonds and clubs” a diamond is an example regard-
less of the order of play. Both experiments used the is therefore of relatively more importance than in the
evidence-rich mature sciences which have reached the
contingent rules of patterns of alternation of suits be-
cause they are a more realistic analog than noncontin- theory testing stage.
Second, consider the criterion of certainty in labora-
gent rules of inductive domains where generalizations,
rules, and principles become progressively apparent as tory hypothesis testing tasks and well-developed sci-
ences. A correct rule exists in laboratory hypothesis
evidence progressively develops.
The results also extend the two previous studies of testing tasks because it has been chosen by the experi-
menter who gives unambiguous error-free feedback in-
instructions to use positive and negative tests for coop-
erative groups. Consistent with the current results, dicating whether positive and negative hypothesis tests
are followed by further examples or nonexamples. In
Laughlin and Futoran (1985) found that uninstructed
Controls had more correct hypotheses than groups in- contrast, no Omniscient Experimenter chooses correct
hypotheses and provides unambiguous error-free feed-
structed to use all negative tests. In contrast to the
current results, Gorman et al. (1984) found better per- back in scientific research, auditing, or the other situa-
tions in which people test hypotheses in the search for
formance for groups who were instructed to use nega-
tive hypothesis tests. Differences in the experimental generalizations, rules, and principles. Thus, the pre-
scriptive falsification proposed by philosophers of sci-
procedures probably account for these different results.
As in Eleusis, the groups in the Gorman et al. experi- ence applies to a less certain criterion than the certain
criterion imposed by the experimenter in laboratory
ment started with a limited number of cards and were
given two cards for each nonexample they played, hypothesis testing tasks.
whereas the groups in the current and Laughlin and
Futoran experiments had an unlimited number of decks REFERENCES
of cards to play one card per array on each of 10 trials
and were not given further cards for playing nonexam- Abbott, R. (1977). The new Eleusis. New York: Author.
ples. Hence the groups in the Gorman et al. experiment Bruner, J. S., Goodnow, J. J., & Austin, G. A. (1956). A study of
thinking. New York: Wiley.
may have had an additional incentive to seek nonexam-
ples in order to get more cards to play and obtain further Gardner, M. (1977). Mathematical games. Scientific American,
237(4), 18–25.
information. The four experiments are consistent in
Gorman, M. E., Gorman, M. E., Latta, M., & Cunningham, G. (1984).
demonstrating the importance of obtaining further evi-
How disconfirmatory, confirmatory, and combined strategies affect
dence to induce the correct rule.
group problem solving. British Journal of Psychology, 75, 65–79.
How do these results for the descriptive effectiveness
Klayman, J. (1995). Varieties of confirmation bias. In J. R. Busemeyer,
of positive tests and the previously demonstrated effec-
R. Hastie, & D. L. Medin (Eds.), Decision making from the perspec-
tiveness of a positive test strategy in other laboratory tive of cognitive psychology (pp. 385–418). New York: Academic
hypothesis testing tasks (Klayman, 1995; Klayman & Press.
Ha, 1987, 1989) relate to the prescriptive falsification Klayman, J., & Ha, Y-M. (1987). Confirmation, disconfirmation, and
of philosophers of science? First, consider the amount information in hypothesis testing. Psychological Review, 94, 211–
228.
of evidence in well-developed sciences and laboratory
hypothesis testing tasks. Popper (1959, 1972) proposed Klayman, J., & Ha, Y-M. (1989). Hypothesis testing in rule discovery:
11. GROUP HYPOTHESIS TESTING 275
Strategy, structure, and content. Journal of Experimental Psychol- Popper, K. R. (1959). The logic of scientific discovery. New York: Ba-
sic Books.
ogy: Learning, Memory, and Cognition, 15, 596–604.
Lakatos, I. (1970). Falsification and methodology of scientific research Popper, K. R. (1972). Objective knowledge. Oxford, England:
programmes. In I. Lakatos & A. Musgrave (Eds.), Criticism and Clarendon.
the growth of scientific knowledge (pp. 91–196). Amsterdam:
Romesburg, H. C. (1979). Simulating scientific inquiry with the card
North Holland.
game Eleusis. Science Education, 63, 599–608.
Laughlin, P. R., & Futoran, G. C. (1985). Collective induction: Social
Sen, A. K. (1970). Collective choice and individual values. New York:
combination and sequential transition. Journal of Personality and
Holden-Day.
Social Psychology, 48, 608–613.
Meehl, P. E. (1990). Appraising and amending theories: The strategy Wason, P. C. (1960). On the failure to eliminate hypotheses in a
conceptual task. Quarterly Journal of Experimental Psychology,
of Lakatosian defense and two principles that warrant it. Psycho-
logical Inquiry, 1, 108–141. 12, 129–140.
Received: July 22, 1996