This document summarizes Andrew Gelman's article about the usefulness of p-values. Gelman groups p-values into three categories: strongly useful, weakly useful, and misleading. He provides examples from his own work where p-values were strongly useful in determining an election was fairly run, and weakly useful in a redistricting study where reporting the p-value would have been unnecessary. Gelman also discusses a study by Daryl Bem that misleadingly interpreted p-values to support precognition, when more analysis showed the data did not actually support that hypothesis.
Section 1 Data File DescriptionThe fictional data represents a te.docx
Discussion of P-value
1. The
Good,
the
Bad,
and
the
Misleading
Qi
Zhou,
Steven
Gregory
Yanlin
Ma,
Xing
Zoey
Zong
ABSTRACT:
Andrew
Gelman’s
journal
article,
“P
Values
and
Statistical
Practice”1
chiefly
looks
to
respond
to
claims
put
forth
in
an
article,
“Living
with
P
values:
Resurrecting
a
Bayesian
perspective
on
Frequentist
Statistics”
by
Sander
Greenland
and
Charles
Poole2
.
This
article
deals
with
the
relation
of
p
values
to
Bayesian
principles
of
prior
and
posterior
distributions.
Because
we
have
not
yet
studied
topics
in
Bayesian
statistics
we
will
focus
our
analysis
on
Gelman’s
experiences
with
p
values
and
his
classifications
of
the
usefulness
of
them.
In
setting
up
his
argument
regarding
the
Bayesian
ideas
of
Greenland
and
Poole,
Gelman
defines
p
values
and
gives
examples
of
his
and
others’
experience
using
p
values
to
come
to
statistically
significant
conclusions.
Gelman
summarizes
that
sometimes
p
values
are
very
useful
in
coming
to
conclusions,
other
times
they
are
unnecessary,
and
while
still
other
times
they
can
mislead
from
more
significant
conclusions
that
can
be
drawn.
We
will
then
use
separate
examples
we
have
seen
to
evaluate
Gelman’s
groupings
of
the
effectiveness.
We
also
compared
Gelman’s
beliefs
about
p
values
to
what
we
learned
in
STAT
341,
and
found
that
p
values
may
not
be
as
effective
as
we
previously
believed.
2. Gelman
begins
the
body
of
his
article
by
giving
his
definition
of
a
p
value
and
explaining
some
immediate
problems
with
the
use
of
them.
He
defines
a
p
value
as
the
probability
that
a
value
is
greater
than
the
observed
data
assuming
that
the
null
hypothesis
is
true.
Thus,
to
secure
statistical
significance
in
rejecting
the
null
hypothesis,
the
p
value
must
be
low
to
show
that
the
data
does
not
come
from
the
null
hypothesis.
This
definition
and
interpretation
of
p
values
is
similar
to
what
we
learned
in
STAT
341.
P
values
are
then
grouped
into
three
categories:
strong
evidence,
weak
evidence,
and
no
evidence.
If
the
p
value
is
less
than
.01,
it
is
strong,
and
if
it
is
between
.01
and
.1
it
is
weak.
Any
p
value
greater
than
.1
is
not
significant.
Gelman
finds
an
immediate
problem
with
p
values
in
that
comparison
is
hard
between
p
values
because
the
differences
between
two
results
is
not
significant.
Thus,
the
p
value
is
a
statistic
and
a
measure
of
evidence
that
has
a
lot
of
noise.
Gelman
then
discusses
his
experience
using
and
reading
about
p
values.
He
first
tells
about
his
experience
determining
if
a
local
election
had
been
rigged
because
it
appeared
as
if
the
number
of
votes
for
each
candidate
was
increasing
at
a
suspiciously
constant
rate.3
Gelman
used
a
chi-‐square
test
with
testing
the
standard
deviation
of
the
results.
The
results
of
the
test
showed
that
it
was
certainly
possible
that
voters
randomly
coming
to
the
polls
could
have
produced
the
pattern
in
which
the
votes
were
tallied.
Gelman
calculated
a
high
p
value
and
was
able
to
confidently
say
a
null
hypothesis
of
the
election
being
fairly
run
could
not
be
rejected.
This
was
a
case
where
a
p
value
worked.
Gelman
then
tells
of
his
study
into
the
effects
of
redistricting
in
state
legislatures.
In
this
case
Gelman
chose
not
to
report
a
p
value,
but
instead
reported
that
the
data
was
more
than
two
standard
errors
from
zero
which
he
states
would
3. have
satisfied
a
.05
significance
level.
Gelman
writes
that
using
a
p
value
would
have
been
fine
and
effective,
but
unnecessary.
Finally,
Gelman
tells
of
a
study
by
Daryl
J.
Bem4
that
incorrectly
interpreted
p
values.
Bem’s
study
claims
that
there
is
evidence
that
humans
may
have
the
ability
for
precognition,
or
knowing
the
future.
Gelman
asserts
that
if
a
researcher
tries
hard
enough,
he
can
find
statistical
significance
in
any
experiment.
Gelman
suggests
that
Bem
only
used
parts
of
his
data,
so
that
the
data
would
support
his
conclusion.
Another
criticism
of
the
Bem
study
by
Eric-‐Jan
Wagenmakers
et.
al5
claims
that
“the
Bayesian
t-‐test
indicates
that
the
data
of
Bem
(2011)
do
not
support
the
hypothesis
of
precognition.”
The
Wagenmakers
article
states
that
Bem’s
study
did
not
explore
its
own
data
enough,
and
that
using
more
refined
statistical
methodology
will
actually
support
a
rejection
of
the
claim
that
precognition
is
possible.
P
values
can
be
used
to
create
unsatisfactory
or
even
wrong
conclusions
if
they
are
not
handled
in
the
correct
manner.
Now,
we
will
evaluate
Gelman’s
analysis
of
p
values
by
looking
at
separate
examples
and
compare
his
ideas
to
those
that
we
learned
in
STAT
341.
In
lecture,
Professor
Guttorp
cited6
a
study
by
Gluckson
and
Leone
that
dealt
with
whether
the
supposed
Sports
Illustrated
cover
jinx
existed.
The
theory
behind
the
jinx
stated
that
athlete
performance
diminished
after
appearing
on
the
cover
of
the
magazine.
If
p
represents
the
percentage
of
athletes
whose
performance
diminished,
then
a
null
hypothesis
of
p=.5
with
an
alternative
of
p>.5
is
established.
The
study
found
that
114
out
of
271
sampled
athlete’s
performance
decreased
after
appearing
on
the
cover.
The
p
value
in
this
case
is
the
probability
that
in
the
total
population
of
athletes,
more
than
114
out
of
271
(p
=
.421)
will
have
decreased
performance
assuming
that
p=.5
is
true.
This
p
value
is
.996,
which
is
clearly
not
significant
and
is
evidence
4. that
the
data
is
certainly
not
in
line
with
the
alternative
hypothesis
that
athlete
performance
declines
more
than
half
of
the
time.
Earlier
in
Professor
Guttorp’s
lecture
notes6
,
he
had
solved
this
hypothesis
testing
question
using
confidence
intervals.
He
had
found
that
a
95%
confidence
interval
for
the
true
proportion
of
athletes
whose
performance
declined
based
on
Gluckson
and
Leone’s
data
was
(.36,
.48).
This
confidence
interval
includes
all
values
about
two
standard
errors
away
from
the
observed
p
=
.421.
We
were
able
to
clearly
reject
the
alternative
hypothesis
that
athlete
performance
declined
most
often,
and
could
even
have
rejected
the
null
hypothesis
that
athlete
performance
declined
half
of
the
time.
Clearly
using
this
method
brings
us
to
a
definitive
rejection
of
the
alternative
hypothesis,
just
as
using
the
p
value
approach
did.
This
observation
is
in
line
with
Gelman’s
thinking.
Gelman’s
belief
that
a
p
value
can
sometimes
be
effective,
but
not
usually
be
necessary
is
similar
to
the
thinking
we
used
in
STAT
341
in
rejecting
or
accepting
alternative
hypotheses.
In
the
case
of
what
Gelman
describes
as
misleading
p
values,
our
learning
experience
differs
somewhat
to
Gelman’s
views.
In
STAT
341,
we
mostly
assumed
that
the
data
we
were
presented
was
legitimate,
and
any
conclusions
we
could
come
to
by
rejecting
a
null
hypothesis
would
be
proofs
of
an
actual
effect.
Gelman’s
human
precognition
example
as
well
as
some
of
our
own
experiences
show
that
this
is
not
always
the
case.
For
instance,
as
in
the
Bem
study,
sometimes
parts
of
recorded
data
can
be
ignored
so
that
a
statistically
significant
conclusion
can
be
reached.
If
data
that
support
a
conclusion
that
a
researcher
wants
to
find
are
hand-‐picked
over
less
conclusive
data,
a
misleading
p
value
can
be
used
to
show
significance
when
there
is
none.
For
example,
suppose
a
person
who
wants
to
5. test
on
a
low
approval
rating
against
a
high
rating
of
the
Washington
state
government
could
collect
sample
data
by
distributing
and
calling
back
questionnaires.
After
analysis,
he
gets
a
highly
significant
result
from
using
only
data
that
come
from
questionnaires
he
sent
to
large
companies
and
concludes
that
people
in
Washington
State
assign
a
high
rating
to
the
state
government.
The
problem
here
is
that
he
only
focused
on
people
in
companies,
and
ignored
all
of
the
other
citizens
who
have
an
opinion
on
the
government.
This
conclusion
the
analyst
would
come
to
is
incorrect
because
his
ignored
portions
of
his
data
that
would
have
given
him
an
insignificant
conclusion.
We
also
found
that
sample
size
can
make
insignificant
conclusions
significant.
Refer
to
figure
1
from
an
article
by
Patrick
Runkel7
.
In
both
Examples
1
and
2
the
means,
the
difference
between
them,
and
the
standard
deviations
are
similar.
But
the
sample
sizes
and
the
p-‐values
differ
greatly.
When
sample
sizes
are
large,
p
values
can
detect
very
small
differences.
So,
what
could
actually
be
a
very
small
change
could
be
shown
to
be
very
significant
by
a
small
p
value.
When
a
sample
size
is
too
large,
any
outcome
can
be
found
to
be
statistically
significant.
Another
type
of
misleading
P
value
comes
about
when
data
is
not
representative
of
the
population
it
comes
from.
The
cheating
test
we
did
in
class
is
a
good
example
of
this.
Because
the
result
only
reflects
students
in
our
class,
which
has
a
different
make
up
of
students
than
from
all
of
UW,
we
cannot
use
it
to
generalize
to
the
whole
UW.
Therefore,
a
p
value
we
can
calculate
from
our
class
data
does
not
provide
the
whole
picture
and
we
should
not
conclude
anything
about
the
university
as
a
whole.
Gelman’s
assertions
that
p
values
are
not
always
as
conclusive
as
they
seem
runs
counter
to
what
we
learned
in
STAT
341,
and
it
caused
us
to
find
many
different
reasons
for
why
this
can
be
the
case.
6. The
main
point
we
have
taken
away
from
the
frequentist
portion
of
Gelman’s
article
is
that
p
values
can
be
grouped
into
three
categories:
good,
unnecessary,
and
misleading.
We
find
that
in
the
case
of
good
and
unnecessary
p
values,
what
we
have
learned
is
consistent
with
Gelman’s
beliefs.
But
in
the
case
of
misleading
p
values,
we
find
that
there
are
many
factors
that
we
had
not
yet
considered
which
can
make
using
p
values
an
imperfect
way
of
reasoning.
References
1.
Gelman,
Andrew.
“P
Values
and
Statistical
Practice,”
Epidemiology
24
(2013):
69-‐72.
2.
Gelman,
Andrew.
“55,000
residents
desperately
need
your
help!”
Chance
17
(2004):
28–31.
3.
Greenland
Sander,
Poole
Charles.
“Living
with
P-‐values:
resurrecting
a
Bayesian
perspective
on
frequentist
statistics”.
Epidemiology
24
(2013)
62–68.
4.
Bem,
Daryl.
“Feeling
the
Future:
Experimental
Evidence
for
Anomalous
Retroactive
Influences
on
Cognition
and
Affect.”
Journal
of
Personality
and
Social
Psychology
(2010).
5.
Wagenmakers
E,
Wetzels
R,
Borsboom
D,
van
der
Maas
H.
“Why
Psychologists
Must
Change
the
Way
They
Analyze
Their
Data:
The
Case
of
Psi:
Comment
on
Bem
(2011),”
Journal
of
Personality
and
Social
Psychology
100
(2011):
426-‐432.
6.
“Testing.”
Last
Updated
March
5,
2014.
http://www.stat.washington.edu/peter/341/Testing.pdf).
7.
Runkel,
Patrick.
“Large
Samples:
Too
Much
of
a
Good
Thing?”
The
Minitab
Blog,
June
4,
2012,
http://blog.minitab.com/blog/statistics-‐and-‐quality-‐data-‐analysis/large-‐samples-‐too-‐much-‐of-‐
a-‐good-‐thing