Discussion of P-value

The
Good,
the
Bad,
and
the
Misleading

Qi
Zhou,

Steven
Gregory

Yanlin
Ma,

Xing
Zoey
Zong

ABSTRACT:

Andrew
Gelman’s
journal
article,
“P
Values
and
Statistical
Practice”1

chiefly
looks
to

respond
to
claims
put
forth
in
an
article,
“Living
with
P
values:
Resurrecting
a
Bayesian

perspective
on
Frequentist
Statistics”
by
Sander
Greenland
and
Charles
Poole2
.
This
article
deals

with
the
relation
of
p
values
to
Bayesian
principles
of
prior
and
posterior
distributions.
Because

we
have
not
yet
studied
topics
in
Bayesian
statistics
we
will
focus
our
analysis
on
Gelman’s

experiences
with
p
values
and
his
classifications
of
the
usefulness
of
them.
In
setting
up
his

argument
regarding
the
Bayesian
ideas
of
Greenland
and
Poole,
Gelman
defines
p
values
and

gives
examples
of
his
and
others’
experience
using
p
values
to
come
to
statistically
significant

conclusions.
Gelman
summarizes
that
sometimes
p
values
are
very
useful
in
coming
to

conclusions,
other
times
they
are
unnecessary,
and
while
still
other
times
they
can
mislead

from
more
significant
conclusions
that
can
be
drawn.
We
will
then
use
separate
examples
we

have
seen
to
evaluate
Gelman’s
groupings
of
the
effectiveness.
We
also
compared
Gelman’s

beliefs
about
p
values
to
what
we
learned
in
STAT
341,
and
found
that
p
values
may
not
be
as

effective
as
we
previously
believed.

Gelman
begins
the
body
of
his
article
by
giving
his
definition
of
a
p
value
and
explaining

some
immediate
problems
with
the
use
of
them.
He
defines
a
p
value
as
the
probability
that
a

value
is
greater
than
the
observed
data
assuming
that
the
null
hypothesis
is
true.
Thus,
to

secure
statistical
significance
in
rejecting
the
null
hypothesis,
the
p
value
must
be
low
to
show

that
the
data
does
not
come
from
the
null
hypothesis.
This
definition
and
interpretation
of
p

values
is
similar
to
what
we
learned
in
STAT
341.

P
values
are
then
grouped
into
three

categories:
strong
evidence,
weak
evidence,
and
no
evidence.
If
the
p
value
is
less
than
.01,
it
is

strong,
and
if
it
is
between
.01
and
.1
it
is
weak.
Any
p
value
greater
than
.1
is
not
significant.

Gelman
finds
an
immediate
problem
with
p
values
in
that
comparison
is
hard
between
p
values

because
the
differences
between
two
results
is
not
significant.
Thus,
the
p
value
is
a
statistic

and
a
measure
of
evidence
that
has
a
lot
of
noise.

Gelman
then
discusses
his
experience
using
and
reading
about
p
values.
He
first
tells

about
his
experience
determining
if
a
local
election
had
been
rigged
because
it
appeared
as
if

the
number
of
votes
for
each
candidate
was
increasing
at
a
suspiciously
constant
rate.3

Gelman

used
a
chi-‐square
test
with
testing
the
standard
deviation
of
the
results.
The
results
of
the
test

showed
that
it
was
certainly
possible
that
voters
randomly
coming
to
the
polls
could
have

produced
the
pattern
in
which
the
votes
were
tallied.
Gelman
calculated
a
high
p
value
and
was

able
to
confidently
say
a
null
hypothesis
of
the
election
being
fairly
run
could
not
be
rejected.

This
was
a
case
where
a
p
value
worked.
Gelman
then
tells
of
his
study
into
the
effects
of

redistricting
in
state
legislatures.
In
this
case
Gelman
chose
not
to
report
a
p
value,
but
instead

reported
that
the
data
was
more
than
two
standard
errors
from
zero
which
he
states
would

have
satisfied
a
.05
significance
level.
Gelman
writes
that
using
a
p
value
would
have
been
fine

and
effective,
but
unnecessary.

Finally,
Gelman
tells
of
a
study
by
Daryl
J.
Bem4

that
incorrectly
interpreted
p
values.

Bem’s
study
claims
that
there
is
evidence
that
humans
may
have
the
ability
for
precognition,
or

knowing
the
future.
Gelman
asserts
that
if
a
researcher
tries
hard
enough,
he
can
find
statistical

significance
in
any
experiment.
Gelman
suggests
that
Bem
only
used
parts
of
his
data,
so
that

the
data
would
support
his
conclusion.
Another
criticism
of
the
Bem
study
by
Eric-‐Jan

Wagenmakers
et.
al5

claims
that
“the
Bayesian
t-‐test
indicates
that
the
data
of
Bem
(2011)
do

not
support
the
hypothesis
of
precognition.”
The
Wagenmakers
article
states
that
Bem’s
study

did
not
explore
its
own
data
enough,
and
that
using
more
refined
statistical
methodology
will

actually
support
a
rejection
of
the
claim
that
precognition
is
possible.
P
values
can
be
used
to

create
unsatisfactory
or
even
wrong
conclusions
if
they
are
not
handled
in
the
correct
manner.

Now,
we
will
evaluate
Gelman’s
analysis
of
p
values
by
looking
at
separate
examples

and
compare
his
ideas
to
those
that
we
learned
in
STAT
341.
In
lecture,
Professor
Guttorp
cited6

a
study
by
Gluckson
and
Leone
that
dealt
with
whether
the
supposed
Sports
Illustrated
cover

jinx
existed.
The
theory
behind
the
jinx
stated
that
athlete
performance
diminished
after

appearing
on
the
cover
of
the
magazine.
If
p
represents
the
percentage
of
athletes
whose

performance
diminished,
then
a
null
hypothesis
of
p=.5
with
an
alternative
of
p>.5
is

established.
The
study
found
that
114
out
of
271
sampled
athlete’s
performance
decreased

after
appearing
on
the
cover.
The
p
value
in
this
case
is
the
probability
that
in
the
total

population
of
athletes,
more
than
114
out
of
271
(p
=
.421)
will
have
decreased
performance

assuming
that
p=.5
is
true.
This
p
value
is
.996,
which
is
clearly
not
significant
and
is
evidence

that
the
data
is
certainly
not
in
line
with
the
alternative
hypothesis
that
athlete
performance

declines
more
than
half
of
the
time.

Earlier
in
Professor
Guttorp’s
lecture
notes6
,
he
had
solved
this
hypothesis
testing

question
using
confidence
intervals.
He
had
found
that
a
95%
confidence
interval
for
the
true

proportion
of
athletes
whose
performance
declined
based
on
Gluckson
and
Leone’s
data
was

(.36,
.48).
This
confidence
interval
includes
all
values
about
two
standard
errors
away
from
the

observed
p
=
.421.
We
were
able
to
clearly
reject
the
alternative
hypothesis
that
athlete

performance
declined
most
often,
and
could
even
have
rejected
the
null
hypothesis
that

athlete
performance
declined
half
of
the
time.
Clearly
using
this
method
brings
us
to
a

definitive
rejection
of
the
alternative
hypothesis,
just
as
using
the
p
value
approach
did.
This

observation
is
in
line
with
Gelman’s
thinking.
Gelman’s
belief
that
a
p
value
can
sometimes
be

effective,
but
not
usually
be
necessary
is
similar
to
the
thinking
we
used
in
STAT
341
in
rejecting

or
accepting
alternative
hypotheses.

In
the
case
of
what
Gelman
describes
as
misleading
p
values,
our
learning
experience

differs
somewhat
to
Gelman’s
views.
In
STAT
341,
we
mostly
assumed
that
the
data
we
were

presented
was
legitimate,
and
any
conclusions
we
could
come
to
by
rejecting
a
null
hypothesis

would
be
proofs
of
an
actual
effect.
Gelman’s
human
precognition
example
as
well
as
some
of

our
own
experiences
show
that
this
is
not
always
the
case.

For
instance,
as
in
the
Bem
study,
sometimes
parts
of
recorded
data
can
be
ignored
so

that
a
statistically
significant
conclusion
can
be
reached.
If
data
that
support
a
conclusion
that
a

researcher
wants
to
find
are
hand-‐picked
over
less
conclusive
data,
a
misleading
p
value
can
be

used
to
show
significance
when
there
is
none.
For
example,
suppose
a
person
who
wants
to

test
on
a
low
approval
rating
against
a
high
rating
of
the
Washington
state
government
could

collect
sample
data
by
distributing
and
calling
back
questionnaires.
After
analysis,
he
gets
a

highly
significant
result
from
using
only
data
that
come
from
questionnaires
he
sent
to
large

companies
and
concludes
that
people
in
Washington
State
assign
a
high
rating
to
the
state

government.
The
problem
here
is
that
he
only
focused
on
people
in
companies,
and
ignored
all

of
the
other
citizens
who
have
an
opinion
on
the
government.
This
conclusion
the
analyst

would
come
to
is
incorrect
because
his
ignored
portions
of
his
data
that
would
have
given
him

an
insignificant
conclusion.

We
also
found
that
sample
size
can
make
insignificant
conclusions
significant.
Refer
to

figure
1
from
an
article
by
Patrick
Runkel7
.
In
both
Examples
1
and
2
the
means,
the
difference

between
them,
and
the
standard
deviations
are
similar.
But
the
sample
sizes
and
the
p-‐values

differ
greatly.
When
sample
sizes
are
large,
p
values
can
detect
very
small
differences.
So,
what

could
actually
be
a
very
small
change
could
be
shown
to
be
very
significant
by
a
small
p
value.

When
a
sample
size
is
too
large,
any
outcome
can
be
found
to
be
statistically
significant.

Another
type
of
misleading
P
value
comes
about
when
data
is
not
representative
of
the

population
it
comes
from.
The
cheating
test
we
did
in
class
is
a
good
example
of
this.
Because

the
result
only
reflects
students
in
our
class,
which
has
a
different
make
up
of
students
than

from
all
of
UW,
we
cannot
use
it
to
generalize
to
the
whole
UW.
Therefore,
a
p
value
we
can

calculate
from
our
class
data
does
not
provide
the
whole
picture
and
we
should
not
conclude

anything
about
the
university
as
a
whole.
Gelman’s
assertions
that
p
values
are
not
always
as

conclusive
as
they
seem
runs
counter
to
what
we
learned
in
STAT
341,
and
it
caused
us
to
find

many
different
reasons
for
why
this
can
be
the
case.

The
main
point
we
have
taken
away
from
the
frequentist
portion
of
Gelman’s
article
is

that
p
values
can
be
grouped
into
three
categories:
good,
unnecessary,
and
misleading.
We
find

that
in
the
case
of
good
and
unnecessary
p
values,
what
we
have
learned
is
consistent
with

Gelman’s
beliefs.
But
in
the
case
of
misleading
p
values,
we
find
that
there
are
many
factors

that
we
had
not
yet
considered
which
can
make
using
p
values
an
imperfect
way
of
reasoning.

References

1.
Gelman,
Andrew.
“P
Values
and
Statistical
Practice,”
Epidemiology
24
(2013):
69-‐72.

2.
Gelman,
Andrew.
“55,000
residents
desperately
need
your
help!”
Chance
17
(2004):
28–31.

3.
Greenland
Sander,
Poole
Charles.
“Living
with
P-‐values:
resurrecting
a
Bayesian
perspective

on
frequentist
statistics”.
Epidemiology
24
(2013)
62–68.

4.
Bem,
Daryl.
“Feeling
the
Future:
Experimental
Evidence
for
Anomalous
Retroactive
Influences

on
Cognition
and
Affect.”
Journal
of
Personality
and
Social
Psychology
(2010).

5.
Wagenmakers
E,
Wetzels
R,
Borsboom
D,
van
der
Maas
H.
“Why
Psychologists
Must
Change

the
Way
They
Analyze
Their
Data:
The
Case
of
Psi:
Comment
on
Bem
(2011),”
Journal
of

Personality
and
Social
Psychology
100
(2011):
426-‐432.

6.
“Testing.”
Last
Updated
March
5,
2014.

http://www.stat.washington.edu/peter/341/Testing.pdf).

7.
Runkel,
Patrick.
“Large
Samples:
Too
Much
of
a
Good
Thing?”
The
Minitab
Blog,
June
4,
2012,

http://blog.minitab.com/blog/statistics-‐and-‐quality-‐data-‐analysis/large-‐samples-‐too-‐much-‐of-‐
a-‐good-‐thing

Discussion of P-value

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (15)

Similar to Discussion of P-value

Similar to Discussion of P-value (20)

Discussion of P-value