SlideShare a Scribd company logo
1 of 6
Download to read offline
The	
  Good,	
  the	
  Bad,	
  and	
  the	
  Misleading	
  
Qi	
  Zhou,	
  	
  
Steven	
  Gregory	
  
Yanlin	
  Ma,	
  
Xing	
  Zoey	
  Zong	
  	
  
	
  
ABSTRACT:	
  
Andrew	
  Gelman’s	
  journal	
  article,	
  “P	
  Values	
  and	
  Statistical	
  Practice”1
	
  chiefly	
  looks	
  to	
  
respond	
  to	
  claims	
  put	
  forth	
  in	
  an	
  article,	
  “Living	
  with	
  P	
  values:	
  Resurrecting	
  a	
  Bayesian	
  
perspective	
  on	
  Frequentist	
  Statistics”	
  by	
  Sander	
  Greenland	
  and	
  Charles	
  Poole2
.	
  This	
  article	
  deals	
  
with	
  the	
  relation	
  of	
  p	
  values	
  to	
  Bayesian	
  principles	
  of	
  prior	
  and	
  posterior	
  distributions.	
  Because	
  
we	
  have	
  not	
  yet	
  studied	
  topics	
  in	
  Bayesian	
  statistics	
  we	
  will	
  focus	
  our	
  analysis	
  on	
  Gelman’s	
  
experiences	
  with	
  p	
  values	
  and	
  his	
  classifications	
  of	
  the	
  usefulness	
  of	
  them.	
  In	
  setting	
  up	
  his	
  
argument	
  regarding	
  the	
  Bayesian	
  ideas	
  of	
  Greenland	
  and	
  Poole,	
  Gelman	
  defines	
  p	
  values	
  and	
  
gives	
  examples	
  of	
  his	
  and	
  others’	
  experience	
  using	
  p	
  values	
  to	
  come	
  to	
  statistically	
  significant	
  
conclusions.	
  Gelman	
  summarizes	
  that	
  sometimes	
  p	
  values	
  are	
  very	
  useful	
  in	
  coming	
  to	
  
conclusions,	
  other	
  times	
  they	
  are	
  unnecessary,	
  and	
  while	
  still	
  other	
  times	
  they	
  can	
  mislead	
  
from	
  more	
  significant	
  conclusions	
  that	
  can	
  be	
  drawn.	
  We	
  will	
  then	
  use	
  separate	
  examples	
  we	
  
have	
  seen	
  to	
  evaluate	
  Gelman’s	
  groupings	
  of	
  the	
  effectiveness.	
  We	
  also	
  compared	
  Gelman’s	
  
beliefs	
  about	
  p	
  values	
  to	
  what	
  we	
  learned	
  in	
  STAT	
  341,	
  and	
  found	
  that	
  p	
  values	
  may	
  not	
  be	
  as	
  
effective	
  as	
  we	
  previously	
  believed.	
   	
  
	
  
Gelman	
  begins	
  the	
  body	
  of	
  his	
  article	
  by	
  giving	
  his	
  definition	
  of	
  a	
  p	
  value	
  and	
  explaining	
  
some	
  immediate	
  problems	
  with	
  the	
  use	
  of	
  them.	
  He	
  defines	
  a	
  p	
  value	
  as	
  the	
  probability	
  that	
  a	
  
value	
  is	
  greater	
  than	
  the	
  observed	
  data	
  assuming	
  that	
  the	
  null	
  hypothesis	
  is	
  true.	
  Thus,	
  to	
  
secure	
  statistical	
  significance	
  in	
  rejecting	
  the	
  null	
  hypothesis,	
  the	
  p	
  value	
  must	
  be	
  low	
  to	
  show	
  
that	
  the	
  data	
  does	
  not	
  come	
  from	
  the	
  null	
  hypothesis.	
  This	
  definition	
  and	
  interpretation	
  of	
  p	
  
values	
  is	
  similar	
  to	
  what	
  we	
  learned	
  in	
  STAT	
  341.	
  	
  P	
  values	
  are	
  then	
  grouped	
  into	
  three	
  
categories:	
  strong	
  evidence,	
  weak	
  evidence,	
  and	
  no	
  evidence.	
  If	
  the	
  p	
  value	
  is	
  less	
  than	
  .01,	
  it	
  is	
  
strong,	
  and	
  if	
  it	
  is	
  between	
  .01	
  and	
  .1	
  it	
  is	
  weak.	
  Any	
  p	
  value	
  greater	
  than	
  .1	
  is	
  not	
  significant.	
  	
  
Gelman	
  finds	
  an	
  immediate	
  problem	
  with	
  p	
  values	
  in	
  that	
  comparison	
  is	
  hard	
  between	
  p	
  values	
  
because	
  the	
  differences	
  between	
  two	
  results	
  is	
  not	
  significant.	
  Thus,	
  the	
  p	
  value	
  is	
  a	
  statistic	
  
and	
  a	
  measure	
  of	
  evidence	
  that	
  has	
  a	
  lot	
  of	
  noise.	
  
	
   Gelman	
  then	
  discusses	
  his	
  experience	
  using	
  and	
  reading	
  about	
  p	
  values.	
  He	
  first	
  tells	
  
about	
  his	
  experience	
  determining	
  if	
  a	
  local	
  election	
  had	
  been	
  rigged	
  because	
  it	
  appeared	
  as	
  if	
  
the	
  number	
  of	
  votes	
  for	
  each	
  candidate	
  was	
  increasing	
  at	
  a	
  suspiciously	
  constant	
  rate.3
	
  Gelman	
  
used	
  a	
  chi-­‐square	
  test	
  with	
  testing	
  the	
  standard	
  deviation	
  of	
  the	
  results.	
  The	
  results	
  of	
  the	
  test	
  
showed	
  that	
  it	
  was	
  certainly	
  possible	
  that	
  voters	
  randomly	
  coming	
  to	
  the	
  polls	
  could	
  have	
  
produced	
  the	
  pattern	
  in	
  which	
  the	
  votes	
  were	
  tallied.	
  Gelman	
  calculated	
  a	
  high	
  p	
  value	
  and	
  was	
  
able	
  to	
  confidently	
  say	
  a	
  null	
  hypothesis	
  of	
  the	
  election	
  being	
  fairly	
  run	
  could	
  not	
  be	
  rejected.	
  
This	
  was	
  a	
  case	
  where	
  a	
  p	
  value	
  worked.	
  Gelman	
  then	
  tells	
  of	
  his	
  study	
  into	
  the	
  effects	
  of	
  
redistricting	
  in	
  state	
  legislatures.	
  In	
  this	
  case	
  Gelman	
  chose	
  not	
  to	
  report	
  a	
  p	
  value,	
  but	
  instead	
  
reported	
  that	
  the	
  data	
  was	
  more	
  than	
  two	
  standard	
  errors	
  from	
  zero	
  which	
  he	
  states	
  would	
  
have	
  satisfied	
  a	
  .05	
  significance	
  level.	
  Gelman	
  writes	
  that	
  using	
  a	
  p	
  value	
  would	
  have	
  been	
  fine	
  
and	
  effective,	
  but	
  unnecessary.	
  	
  
Finally,	
  Gelman	
  tells	
  of	
  a	
  study	
  by	
  Daryl	
  J.	
  Bem4
	
  that	
  incorrectly	
  interpreted	
  p	
  values.	
  
Bem’s	
  study	
  claims	
  that	
  there	
  is	
  evidence	
  that	
  humans	
  may	
  have	
  the	
  ability	
  for	
  precognition,	
  or	
  
knowing	
  the	
  future.	
  Gelman	
  asserts	
  that	
  if	
  a	
  researcher	
  tries	
  hard	
  enough,	
  he	
  can	
  find	
  statistical	
  
significance	
  in	
  any	
  experiment.	
  Gelman	
  suggests	
  that	
  Bem	
  only	
  used	
  parts	
  of	
  his	
  data,	
  so	
  that	
  
the	
  data	
  would	
  support	
  his	
  conclusion.	
  Another	
  criticism	
  of	
  the	
  Bem	
  study	
  by	
  Eric-­‐Jan	
  
Wagenmakers	
  et.	
  al5
	
  claims	
  that	
  “the	
  Bayesian	
  t-­‐test	
  indicates	
  that	
  the	
  data	
  of	
  Bem	
  (2011)	
  do	
  
not	
  support	
  the	
  hypothesis	
  of	
  precognition.”	
  The	
  Wagenmakers	
  article	
  states	
  that	
  Bem’s	
  study	
  
did	
  not	
  explore	
  its	
  own	
  data	
  enough,	
  and	
  that	
  using	
  more	
  refined	
  statistical	
  methodology	
  will	
  
actually	
  support	
  a	
  rejection	
  of	
  the	
  claim	
  that	
  precognition	
  is	
  possible.	
  P	
  values	
  can	
  be	
  used	
  to	
  
create	
  unsatisfactory	
  or	
  even	
  wrong	
  conclusions	
  if	
  they	
  are	
  not	
  handled	
  in	
  the	
  correct	
  manner.	
  	
  
	
   Now,	
  we	
  will	
  evaluate	
  Gelman’s	
  analysis	
  of	
  p	
  values	
  by	
  looking	
  at	
  separate	
  examples	
  
and	
  compare	
  his	
  ideas	
  to	
  those	
  that	
  we	
  learned	
  in	
  STAT	
  341.	
  In	
  lecture,	
  Professor	
  Guttorp	
  cited6
	
  
a	
  study	
  by	
  Gluckson	
  and	
  Leone	
  that	
  dealt	
  with	
  whether	
  the	
  supposed	
  Sports	
  Illustrated	
  cover	
  
jinx	
  existed.	
  The	
  theory	
  behind	
  the	
  jinx	
  stated	
  that	
  athlete	
  performance	
  diminished	
  after	
  
appearing	
  on	
  the	
  cover	
  of	
  the	
  magazine.	
  If	
  p	
  represents	
  the	
  percentage	
  of	
  athletes	
  whose	
  
performance	
  diminished,	
  then	
  a	
  null	
  hypothesis	
  of	
  p=.5	
  with	
  an	
  alternative	
  of	
  p>.5	
  is	
  
established.	
  The	
  study	
  found	
  that	
  114	
  out	
  of	
  271	
  sampled	
  athlete’s	
  performance	
  decreased	
  
after	
  appearing	
  on	
  the	
  cover.	
  The	
  p	
  value	
  in	
  this	
  case	
  is	
  the	
  probability	
  that	
  in	
  the	
  total	
  
population	
  of	
  athletes,	
  more	
  than	
  114	
  out	
  of	
  271	
  (p	
  =	
  .421)	
  will	
  have	
  decreased	
  performance	
  
assuming	
  that	
  p=.5	
  is	
  true.	
  This	
  p	
  value	
  is	
  .996,	
  which	
  is	
  clearly	
  not	
  significant	
  and	
  is	
  evidence	
  
that	
  the	
  data	
  is	
  certainly	
  not	
  in	
  line	
  with	
  the	
  alternative	
  hypothesis	
  that	
  athlete	
  performance	
  
declines	
  more	
  than	
  half	
  of	
  the	
  time.	
  	
  
	
   Earlier	
  in	
  Professor	
  Guttorp’s	
  lecture	
  notes6
,	
  he	
  had	
  solved	
  this	
  hypothesis	
  testing	
  
question	
  using	
  confidence	
  intervals.	
  He	
  had	
  found	
  that	
  a	
  95%	
  confidence	
  interval	
  for	
  the	
  true	
  
proportion	
  of	
  athletes	
  whose	
  performance	
  declined	
  based	
  on	
  Gluckson	
  and	
  Leone’s	
  data	
  was	
  
(.36,	
  .48).	
  This	
  confidence	
  interval	
  includes	
  all	
  values	
  about	
  two	
  standard	
  errors	
  away	
  from	
  the	
  
observed	
  p	
  =	
  .421.	
  We	
  were	
  able	
  to	
  clearly	
  reject	
  the	
  alternative	
  hypothesis	
  that	
  athlete	
  
performance	
  declined	
  most	
  often,	
  and	
  could	
  even	
  have	
  rejected	
  the	
  null	
  hypothesis	
  that	
  
athlete	
  performance	
  declined	
  half	
  of	
  the	
  time.	
  Clearly	
  using	
  this	
  method	
  brings	
  us	
  to	
  a	
  
definitive	
  rejection	
  of	
  the	
  alternative	
  hypothesis,	
  just	
  as	
  using	
  the	
  p	
  value	
  approach	
  did.	
  This	
  
observation	
  is	
  in	
  line	
  with	
  Gelman’s	
  thinking.	
  Gelman’s	
  belief	
  that	
  a	
  p	
  value	
  can	
  sometimes	
  be	
  
effective,	
  but	
  not	
  usually	
  be	
  necessary	
  is	
  similar	
  to	
  the	
  thinking	
  we	
  used	
  in	
  STAT	
  341	
  in	
  rejecting	
  
or	
  accepting	
  alternative	
  hypotheses.	
  	
  
	
   In	
  the	
  case	
  of	
  what	
  Gelman	
  describes	
  as	
  misleading	
  p	
  values,	
  our	
  learning	
  experience	
  
differs	
  somewhat	
  to	
  Gelman’s	
  views.	
  In	
  STAT	
  341,	
  we	
  mostly	
  assumed	
  that	
  the	
  data	
  we	
  were	
  
presented	
  was	
  legitimate,	
  and	
  any	
  conclusions	
  we	
  could	
  come	
  to	
  by	
  rejecting	
  a	
  null	
  hypothesis	
  
would	
  be	
  proofs	
  of	
  an	
  actual	
  effect.	
  Gelman’s	
  human	
  precognition	
  example	
  as	
  well	
  as	
  some	
  of	
  
our	
  own	
  experiences	
  show	
  that	
  this	
  is	
  not	
  always	
  the	
  case.	
  	
  
For	
  instance,	
  as	
  in	
  the	
  Bem	
  study,	
  sometimes	
  parts	
  of	
  recorded	
  data	
  can	
  be	
  ignored	
  so	
  
that	
  a	
  statistically	
  significant	
  conclusion	
  can	
  be	
  reached.	
  If	
  data	
  that	
  support	
  a	
  conclusion	
  that	
  a	
  
researcher	
  wants	
  to	
  find	
  are	
  hand-­‐picked	
  over	
  less	
  conclusive	
  data,	
  a	
  misleading	
  p	
  value	
  can	
  be	
  
used	
  to	
  show	
  significance	
  when	
  there	
  is	
  none.	
  For	
  example,	
  suppose	
  a	
  person	
  who	
  wants	
  to	
  
test	
  on	
  a	
  low	
  approval	
  rating	
  against	
  a	
  high	
  rating	
  of	
  the	
  Washington	
  state	
  government	
  could	
  
collect	
  sample	
  data	
  by	
  distributing	
  and	
  calling	
  back	
  questionnaires.	
  After	
  analysis,	
  he	
  gets	
  a	
  
highly	
  significant	
  result	
  from	
  using	
  only	
  data	
  that	
  come	
  from	
  questionnaires	
  he	
  sent	
  to	
  large	
  
companies	
  and	
  concludes	
  that	
  people	
  in	
  Washington	
  State	
  assign	
  a	
  high	
  rating	
  to	
  the	
  state	
  
government.	
  The	
  problem	
  here	
  is	
  that	
  he	
  only	
  focused	
  on	
  people	
  in	
  companies,	
  and	
  ignored	
  all	
  
of	
  the	
  other	
  citizens	
  who	
  have	
  an	
  opinion	
  on	
  the	
  government.	
  This	
  conclusion	
  the	
  analyst	
  
would	
  come	
  to	
  is	
  incorrect	
  because	
  his	
  ignored	
  portions	
  of	
  his	
  data	
  that	
  would	
  have	
  given	
  him	
  
an	
  insignificant	
  conclusion.	
  
	
  We	
  also	
  found	
  that	
  sample	
  size	
  can	
  make	
  insignificant	
  conclusions	
  significant.	
  Refer	
  to	
  
figure	
  1	
  from	
  an	
  article	
  by	
  Patrick	
  Runkel7
.	
  In	
  both	
  Examples	
  1	
  and	
  2	
  the	
  means,	
  the	
  difference	
  
between	
  them,	
  and	
  the	
  standard	
  deviations	
  are	
  similar.	
  But	
  the	
  sample	
  sizes	
  and	
  the	
  p-­‐values	
  
differ	
  greatly.	
  When	
  sample	
  sizes	
  are	
  large,	
  p	
  values	
  can	
  detect	
  very	
  small	
  differences.	
  So,	
  what	
  
could	
  actually	
  be	
  a	
  very	
  small	
  change	
  could	
  be	
  shown	
  to	
  be	
  very	
  significant	
  by	
  a	
  small	
  p	
  value.	
  
When	
  a	
  sample	
  size	
  is	
  too	
  large,	
  any	
  outcome	
  can	
  be	
  found	
  to	
  be	
  statistically	
  significant.	
  
	
   Another	
  type	
  of	
  misleading	
  P	
  value	
  comes	
  about	
  when	
  data	
  is	
  not	
  representative	
  of	
  the	
  
population	
  it	
  comes	
  from.	
  The	
  cheating	
  test	
  we	
  did	
  in	
  class	
  is	
  a	
  good	
  example	
  of	
  this.	
  Because	
  
the	
  result	
  only	
  reflects	
  students	
  in	
  our	
  class,	
  which	
  has	
  a	
  different	
  make	
  up	
  of	
  students	
  than	
  
from	
  all	
  of	
  UW,	
  we	
  cannot	
  use	
  it	
  to	
  generalize	
  to	
  the	
  whole	
  UW.	
  Therefore,	
  a	
  p	
  value	
  we	
  can	
  
calculate	
  from	
  our	
  class	
  data	
  does	
  not	
  provide	
  the	
  whole	
  picture	
  and	
  we	
  should	
  not	
  conclude	
  
anything	
  about	
  the	
  university	
  as	
  a	
  whole.	
  Gelman’s	
  assertions	
  that	
  p	
  values	
  are	
  not	
  always	
  as	
  
conclusive	
  as	
  they	
  seem	
  runs	
  counter	
  to	
  what	
  we	
  learned	
  in	
  STAT	
  341,	
  and	
  it	
  caused	
  us	
  to	
  find	
  
many	
  different	
  reasons	
  for	
  why	
  this	
  can	
  be	
  the	
  case.	
  
  The	
  main	
  point	
  we	
  have	
  taken	
  away	
  from	
  the	
  frequentist	
  portion	
  of	
  Gelman’s	
  article	
  is	
  
that	
  p	
  values	
  can	
  be	
  grouped	
  into	
  three	
  categories:	
  good,	
  unnecessary,	
  and	
  misleading.	
  We	
  find	
  
that	
  in	
  the	
  case	
  of	
  good	
  and	
  unnecessary	
  p	
  values,	
  what	
  we	
  have	
  learned	
  is	
  consistent	
  with	
  
Gelman’s	
  beliefs.	
  But	
  in	
  the	
  case	
  of	
  misleading	
  p	
  values,	
  we	
  find	
  that	
  there	
  are	
  many	
  factors	
  
that	
  we	
  had	
  not	
  yet	
  considered	
  which	
  can	
  make	
  using	
  p	
  values	
  an	
  imperfect	
  way	
  of	
  reasoning.	
  
	
  
	
  
	
  
References	
  
	
  
1.	
  Gelman,	
  Andrew.	
  “P	
  Values	
  and	
  Statistical	
  Practice,”	
  Epidemiology	
  24	
  (2013):	
  69-­‐72.	
  
2.	
  Gelman,	
  Andrew.	
  “55,000	
  residents	
  desperately	
  need	
  your	
  help!”	
  Chance	
  17	
  (2004):	
  28–31.	
  
3.	
  Greenland	
  Sander,	
  Poole	
  Charles.	
  “Living	
  with	
  P-­‐values:	
  resurrecting	
  a	
  Bayesian	
  perspective	
  
on	
  frequentist	
  statistics”.	
  Epidemiology	
  24	
  (2013)	
  62–68.	
  
4.	
  Bem,	
  Daryl.	
  “Feeling	
  the	
  Future:	
  Experimental	
  Evidence	
  for	
  Anomalous	
  Retroactive	
  Influences	
  
on	
  Cognition	
  and	
  Affect.”	
  Journal	
  of	
  Personality	
  and	
  Social	
  Psychology	
  (2010).	
  
5.	
  Wagenmakers	
  E,	
  Wetzels	
  R,	
  Borsboom	
  D,	
  van	
  der	
  Maas	
  H.	
  “Why	
  Psychologists	
  Must	
  Change	
  
the	
  Way	
  They	
  Analyze	
  Their	
  Data:	
  The	
  Case	
  of	
  Psi:	
  Comment	
  on	
  Bem	
  (2011),”	
  Journal	
  of	
  
Personality	
  and	
  Social	
  Psychology	
  100	
  (2011):	
  426-­‐432.	
  
6.	
  “Testing.”	
  Last	
  Updated	
  March	
  5,	
  2014.	
  
http://www.stat.washington.edu/peter/341/Testing.pdf).	
  
7.	
  Runkel,	
  Patrick.	
  “Large	
  Samples:	
  Too	
  Much	
  of	
  a	
  Good	
  Thing?”	
  The	
  Minitab	
  Blog,	
  June	
  4,	
  2012,	
  
http://blog.minitab.com/blog/statistics-­‐and-­‐quality-­‐data-­‐analysis/large-­‐samples-­‐too-­‐much-­‐of-­‐
a-­‐good-­‐thing	
  
	
  

More Related Content

What's hot (10)

Research methods(2)
Research methods(2)Research methods(2)
Research methods(2)
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Null hypothesis
Null hypothesisNull hypothesis
Null hypothesis
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Chapter 1, Myers Psychology 9e
Chapter 1, Myers Psychology 9eChapter 1, Myers Psychology 9e
Chapter 1, Myers Psychology 9e
 
Rm 3 Hypothesis
Rm   3   HypothesisRm   3   Hypothesis
Rm 3 Hypothesis
 
Spss session 1 and 2
Spss session 1 and 2Spss session 1 and 2
Spss session 1 and 2
 
Missing data and non response pdf
Missing data and non response pdfMissing data and non response pdf
Missing data and non response pdf
 
Testing of hypothesis
Testing of hypothesisTesting of hypothesis
Testing of hypothesis
 
Hypothesis testing ppt final
Hypothesis testing ppt finalHypothesis testing ppt final
Hypothesis testing ppt final
 

Viewers also liked

Viewers also liked (15)

nomlab_okayamaruby_slide
nomlab_okayamaruby_slidenomlab_okayamaruby_slide
nomlab_okayamaruby_slide
 
Peak performance
Peak  performancePeak  performance
Peak performance
 
Bab III ANALISIS HTN
Bab III ANALISIS HTNBab III ANALISIS HTN
Bab III ANALISIS HTN
 
Bidding the Inventions as Incentive Schemes and the Ownership Structure
Bidding the Inventions as Incentive Schemes and the Ownership StructureBidding the Inventions as Incentive Schemes and the Ownership Structure
Bidding the Inventions as Incentive Schemes and the Ownership Structure
 
Want to work from home
Want to work from homeWant to work from home
Want to work from home
 
Bab IV PENUTUP
Bab IV PENUTUPBab IV PENUTUP
Bab IV PENUTUP
 
MAFTEC - A Case Study
MAFTEC - A Case StudyMAFTEC - A Case Study
MAFTEC - A Case Study
 
Foreign direct investment
Foreign direct  investmentForeign direct  investment
Foreign direct investment
 
Start up and human resource
Start up and human resourceStart up and human resource
Start up and human resource
 
20140307 tech nightvol11_lt_v1.0_public
20140307 tech nightvol11_lt_v1.0_public20140307 tech nightvol11_lt_v1.0_public
20140307 tech nightvol11_lt_v1.0_public
 
Perfectessay.net essay sample #1 chicago style
Perfectessay.net essay sample #1 chicago stylePerfectessay.net essay sample #1 chicago style
Perfectessay.net essay sample #1 chicago style
 
Title Research
Title Research Title Research
Title Research
 
Alliance presentation
Alliance presentationAlliance presentation
Alliance presentation
 
NIST SP 800-63C #idcon vol.22
NIST SP 800-63C #idcon vol.22NIST SP 800-63C #idcon vol.22
NIST SP 800-63C #idcon vol.22
 
Study on different international standards
Study on different international standardsStudy on different international standards
Study on different international standards
 

Similar to Discussion of P-value

Hypothesis TestingIn doing research, one of the most common acti
Hypothesis TestingIn doing research, one of the most common actiHypothesis TestingIn doing research, one of the most common acti
Hypothesis TestingIn doing research, one of the most common acti
NarcisaBrandenburg70
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
karlhennesey
 
Aron chpt 5 ed revised
Aron chpt 5 ed revisedAron chpt 5 ed revised
Aron chpt 5 ed revised
Sandra Nicks
 
Chapter 8-hypothesis-testing-1211425712197151-9
Chapter 8-hypothesis-testing-1211425712197151-9Chapter 8-hypothesis-testing-1211425712197151-9
Chapter 8-hypothesis-testing-1211425712197151-9
stone66
 
Chapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis TestingChapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis Testing
guest3720ca
 
Chapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis TestingChapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis Testing
Rose Jenkins
 
ScenarioStatistical significance is found in a study, but the ef.docx
ScenarioStatistical significance is found in a study, but the ef.docxScenarioStatistical significance is found in a study, but the ef.docx
ScenarioStatistical significance is found in a study, but the ef.docx
anhlodge
 
Section 1 Data File DescriptionThe fictional data represents a te.docx
Section 1 Data File DescriptionThe fictional data represents a te.docxSection 1 Data File DescriptionThe fictional data represents a te.docx
Section 1 Data File DescriptionThe fictional data represents a te.docx
bagotjesusa
 

Similar to Discussion of P-value (20)

Introduction-to-Hypothesis-Testing Explained in detail
Introduction-to-Hypothesis-Testing Explained in detailIntroduction-to-Hypothesis-Testing Explained in detail
Introduction-to-Hypothesis-Testing Explained in detail
 
Reporting Results of Statistical Analysis
Reporting Results of Statistical Analysis Reporting Results of Statistical Analysis
Reporting Results of Statistical Analysis
 
Hypothesis TestingIn doing research, one of the most common acti
Hypothesis TestingIn doing research, one of the most common actiHypothesis TestingIn doing research, one of the most common acti
Hypothesis TestingIn doing research, one of the most common acti
 
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docxPage 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
Page 266LEARNING OBJECTIVES· Explain how researchers use inf.docx
 
Steps in hypothesis.pptx
Steps in hypothesis.pptxSteps in hypothesis.pptx
Steps in hypothesis.pptx
 
4_5875144622430228750.docx
4_5875144622430228750.docx4_5875144622430228750.docx
4_5875144622430228750.docx
 
How to read a paper
How to read a paperHow to read a paper
How to read a paper
 
What does the p value really mean?
What does the p value really mean?What does the p value really mean?
What does the p value really mean?
 
Aron chpt 5 ed revised
Aron chpt 5 ed revisedAron chpt 5 ed revised
Aron chpt 5 ed revised
 
Aron chpt 5 ed
Aron chpt 5 edAron chpt 5 ed
Aron chpt 5 ed
 
Hypothesis Testing.pptx
Hypothesis Testing.pptxHypothesis Testing.pptx
Hypothesis Testing.pptx
 
Chapter 8-hypothesis-testing-1211425712197151-9
Chapter 8-hypothesis-testing-1211425712197151-9Chapter 8-hypothesis-testing-1211425712197151-9
Chapter 8-hypothesis-testing-1211425712197151-9
 
Chapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis TestingChapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis Testing
 
Chapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis TestingChapter 8 – Hypothesis Testing
Chapter 8 – Hypothesis Testing
 
ScenarioStatistical significance is found in a study, but the ef.docx
ScenarioStatistical significance is found in a study, but the ef.docxScenarioStatistical significance is found in a study, but the ef.docx
ScenarioStatistical significance is found in a study, but the ef.docx
 
Berd 5-6
Berd 5-6Berd 5-6
Berd 5-6
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Section 1 Data File DescriptionThe fictional data represents a te.docx
Section 1 Data File DescriptionThe fictional data represents a te.docxSection 1 Data File DescriptionThe fictional data represents a te.docx
Section 1 Data File DescriptionThe fictional data represents a te.docx
 

Discussion of P-value

  • 1. The  Good,  the  Bad,  and  the  Misleading   Qi  Zhou,     Steven  Gregory   Yanlin  Ma,   Xing  Zoey  Zong       ABSTRACT:   Andrew  Gelman’s  journal  article,  “P  Values  and  Statistical  Practice”1  chiefly  looks  to   respond  to  claims  put  forth  in  an  article,  “Living  with  P  values:  Resurrecting  a  Bayesian   perspective  on  Frequentist  Statistics”  by  Sander  Greenland  and  Charles  Poole2 .  This  article  deals   with  the  relation  of  p  values  to  Bayesian  principles  of  prior  and  posterior  distributions.  Because   we  have  not  yet  studied  topics  in  Bayesian  statistics  we  will  focus  our  analysis  on  Gelman’s   experiences  with  p  values  and  his  classifications  of  the  usefulness  of  them.  In  setting  up  his   argument  regarding  the  Bayesian  ideas  of  Greenland  and  Poole,  Gelman  defines  p  values  and   gives  examples  of  his  and  others’  experience  using  p  values  to  come  to  statistically  significant   conclusions.  Gelman  summarizes  that  sometimes  p  values  are  very  useful  in  coming  to   conclusions,  other  times  they  are  unnecessary,  and  while  still  other  times  they  can  mislead   from  more  significant  conclusions  that  can  be  drawn.  We  will  then  use  separate  examples  we   have  seen  to  evaluate  Gelman’s  groupings  of  the  effectiveness.  We  also  compared  Gelman’s   beliefs  about  p  values  to  what  we  learned  in  STAT  341,  and  found  that  p  values  may  not  be  as   effective  as  we  previously  believed.      
  • 2. Gelman  begins  the  body  of  his  article  by  giving  his  definition  of  a  p  value  and  explaining   some  immediate  problems  with  the  use  of  them.  He  defines  a  p  value  as  the  probability  that  a   value  is  greater  than  the  observed  data  assuming  that  the  null  hypothesis  is  true.  Thus,  to   secure  statistical  significance  in  rejecting  the  null  hypothesis,  the  p  value  must  be  low  to  show   that  the  data  does  not  come  from  the  null  hypothesis.  This  definition  and  interpretation  of  p   values  is  similar  to  what  we  learned  in  STAT  341.    P  values  are  then  grouped  into  three   categories:  strong  evidence,  weak  evidence,  and  no  evidence.  If  the  p  value  is  less  than  .01,  it  is   strong,  and  if  it  is  between  .01  and  .1  it  is  weak.  Any  p  value  greater  than  .1  is  not  significant.     Gelman  finds  an  immediate  problem  with  p  values  in  that  comparison  is  hard  between  p  values   because  the  differences  between  two  results  is  not  significant.  Thus,  the  p  value  is  a  statistic   and  a  measure  of  evidence  that  has  a  lot  of  noise.     Gelman  then  discusses  his  experience  using  and  reading  about  p  values.  He  first  tells   about  his  experience  determining  if  a  local  election  had  been  rigged  because  it  appeared  as  if   the  number  of  votes  for  each  candidate  was  increasing  at  a  suspiciously  constant  rate.3  Gelman   used  a  chi-­‐square  test  with  testing  the  standard  deviation  of  the  results.  The  results  of  the  test   showed  that  it  was  certainly  possible  that  voters  randomly  coming  to  the  polls  could  have   produced  the  pattern  in  which  the  votes  were  tallied.  Gelman  calculated  a  high  p  value  and  was   able  to  confidently  say  a  null  hypothesis  of  the  election  being  fairly  run  could  not  be  rejected.   This  was  a  case  where  a  p  value  worked.  Gelman  then  tells  of  his  study  into  the  effects  of   redistricting  in  state  legislatures.  In  this  case  Gelman  chose  not  to  report  a  p  value,  but  instead   reported  that  the  data  was  more  than  two  standard  errors  from  zero  which  he  states  would  
  • 3. have  satisfied  a  .05  significance  level.  Gelman  writes  that  using  a  p  value  would  have  been  fine   and  effective,  but  unnecessary.     Finally,  Gelman  tells  of  a  study  by  Daryl  J.  Bem4  that  incorrectly  interpreted  p  values.   Bem’s  study  claims  that  there  is  evidence  that  humans  may  have  the  ability  for  precognition,  or   knowing  the  future.  Gelman  asserts  that  if  a  researcher  tries  hard  enough,  he  can  find  statistical   significance  in  any  experiment.  Gelman  suggests  that  Bem  only  used  parts  of  his  data,  so  that   the  data  would  support  his  conclusion.  Another  criticism  of  the  Bem  study  by  Eric-­‐Jan   Wagenmakers  et.  al5  claims  that  “the  Bayesian  t-­‐test  indicates  that  the  data  of  Bem  (2011)  do   not  support  the  hypothesis  of  precognition.”  The  Wagenmakers  article  states  that  Bem’s  study   did  not  explore  its  own  data  enough,  and  that  using  more  refined  statistical  methodology  will   actually  support  a  rejection  of  the  claim  that  precognition  is  possible.  P  values  can  be  used  to   create  unsatisfactory  or  even  wrong  conclusions  if  they  are  not  handled  in  the  correct  manner.       Now,  we  will  evaluate  Gelman’s  analysis  of  p  values  by  looking  at  separate  examples   and  compare  his  ideas  to  those  that  we  learned  in  STAT  341.  In  lecture,  Professor  Guttorp  cited6   a  study  by  Gluckson  and  Leone  that  dealt  with  whether  the  supposed  Sports  Illustrated  cover   jinx  existed.  The  theory  behind  the  jinx  stated  that  athlete  performance  diminished  after   appearing  on  the  cover  of  the  magazine.  If  p  represents  the  percentage  of  athletes  whose   performance  diminished,  then  a  null  hypothesis  of  p=.5  with  an  alternative  of  p>.5  is   established.  The  study  found  that  114  out  of  271  sampled  athlete’s  performance  decreased   after  appearing  on  the  cover.  The  p  value  in  this  case  is  the  probability  that  in  the  total   population  of  athletes,  more  than  114  out  of  271  (p  =  .421)  will  have  decreased  performance   assuming  that  p=.5  is  true.  This  p  value  is  .996,  which  is  clearly  not  significant  and  is  evidence  
  • 4. that  the  data  is  certainly  not  in  line  with  the  alternative  hypothesis  that  athlete  performance   declines  more  than  half  of  the  time.       Earlier  in  Professor  Guttorp’s  lecture  notes6 ,  he  had  solved  this  hypothesis  testing   question  using  confidence  intervals.  He  had  found  that  a  95%  confidence  interval  for  the  true   proportion  of  athletes  whose  performance  declined  based  on  Gluckson  and  Leone’s  data  was   (.36,  .48).  This  confidence  interval  includes  all  values  about  two  standard  errors  away  from  the   observed  p  =  .421.  We  were  able  to  clearly  reject  the  alternative  hypothesis  that  athlete   performance  declined  most  often,  and  could  even  have  rejected  the  null  hypothesis  that   athlete  performance  declined  half  of  the  time.  Clearly  using  this  method  brings  us  to  a   definitive  rejection  of  the  alternative  hypothesis,  just  as  using  the  p  value  approach  did.  This   observation  is  in  line  with  Gelman’s  thinking.  Gelman’s  belief  that  a  p  value  can  sometimes  be   effective,  but  not  usually  be  necessary  is  similar  to  the  thinking  we  used  in  STAT  341  in  rejecting   or  accepting  alternative  hypotheses.       In  the  case  of  what  Gelman  describes  as  misleading  p  values,  our  learning  experience   differs  somewhat  to  Gelman’s  views.  In  STAT  341,  we  mostly  assumed  that  the  data  we  were   presented  was  legitimate,  and  any  conclusions  we  could  come  to  by  rejecting  a  null  hypothesis   would  be  proofs  of  an  actual  effect.  Gelman’s  human  precognition  example  as  well  as  some  of   our  own  experiences  show  that  this  is  not  always  the  case.     For  instance,  as  in  the  Bem  study,  sometimes  parts  of  recorded  data  can  be  ignored  so   that  a  statistically  significant  conclusion  can  be  reached.  If  data  that  support  a  conclusion  that  a   researcher  wants  to  find  are  hand-­‐picked  over  less  conclusive  data,  a  misleading  p  value  can  be   used  to  show  significance  when  there  is  none.  For  example,  suppose  a  person  who  wants  to  
  • 5. test  on  a  low  approval  rating  against  a  high  rating  of  the  Washington  state  government  could   collect  sample  data  by  distributing  and  calling  back  questionnaires.  After  analysis,  he  gets  a   highly  significant  result  from  using  only  data  that  come  from  questionnaires  he  sent  to  large   companies  and  concludes  that  people  in  Washington  State  assign  a  high  rating  to  the  state   government.  The  problem  here  is  that  he  only  focused  on  people  in  companies,  and  ignored  all   of  the  other  citizens  who  have  an  opinion  on  the  government.  This  conclusion  the  analyst   would  come  to  is  incorrect  because  his  ignored  portions  of  his  data  that  would  have  given  him   an  insignificant  conclusion.    We  also  found  that  sample  size  can  make  insignificant  conclusions  significant.  Refer  to   figure  1  from  an  article  by  Patrick  Runkel7 .  In  both  Examples  1  and  2  the  means,  the  difference   between  them,  and  the  standard  deviations  are  similar.  But  the  sample  sizes  and  the  p-­‐values   differ  greatly.  When  sample  sizes  are  large,  p  values  can  detect  very  small  differences.  So,  what   could  actually  be  a  very  small  change  could  be  shown  to  be  very  significant  by  a  small  p  value.   When  a  sample  size  is  too  large,  any  outcome  can  be  found  to  be  statistically  significant.     Another  type  of  misleading  P  value  comes  about  when  data  is  not  representative  of  the   population  it  comes  from.  The  cheating  test  we  did  in  class  is  a  good  example  of  this.  Because   the  result  only  reflects  students  in  our  class,  which  has  a  different  make  up  of  students  than   from  all  of  UW,  we  cannot  use  it  to  generalize  to  the  whole  UW.  Therefore,  a  p  value  we  can   calculate  from  our  class  data  does  not  provide  the  whole  picture  and  we  should  not  conclude   anything  about  the  university  as  a  whole.  Gelman’s  assertions  that  p  values  are  not  always  as   conclusive  as  they  seem  runs  counter  to  what  we  learned  in  STAT  341,  and  it  caused  us  to  find   many  different  reasons  for  why  this  can  be  the  case.  
  • 6.   The  main  point  we  have  taken  away  from  the  frequentist  portion  of  Gelman’s  article  is   that  p  values  can  be  grouped  into  three  categories:  good,  unnecessary,  and  misleading.  We  find   that  in  the  case  of  good  and  unnecessary  p  values,  what  we  have  learned  is  consistent  with   Gelman’s  beliefs.  But  in  the  case  of  misleading  p  values,  we  find  that  there  are  many  factors   that  we  had  not  yet  considered  which  can  make  using  p  values  an  imperfect  way  of  reasoning.         References     1.  Gelman,  Andrew.  “P  Values  and  Statistical  Practice,”  Epidemiology  24  (2013):  69-­‐72.   2.  Gelman,  Andrew.  “55,000  residents  desperately  need  your  help!”  Chance  17  (2004):  28–31.   3.  Greenland  Sander,  Poole  Charles.  “Living  with  P-­‐values:  resurrecting  a  Bayesian  perspective   on  frequentist  statistics”.  Epidemiology  24  (2013)  62–68.   4.  Bem,  Daryl.  “Feeling  the  Future:  Experimental  Evidence  for  Anomalous  Retroactive  Influences   on  Cognition  and  Affect.”  Journal  of  Personality  and  Social  Psychology  (2010).   5.  Wagenmakers  E,  Wetzels  R,  Borsboom  D,  van  der  Maas  H.  “Why  Psychologists  Must  Change   the  Way  They  Analyze  Their  Data:  The  Case  of  Psi:  Comment  on  Bem  (2011),”  Journal  of   Personality  and  Social  Psychology  100  (2011):  426-­‐432.   6.  “Testing.”  Last  Updated  March  5,  2014.   http://www.stat.washington.edu/peter/341/Testing.pdf).   7.  Runkel,  Patrick.  “Large  Samples:  Too  Much  of  a  Good  Thing?”  The  Minitab  Blog,  June  4,  2012,   http://blog.minitab.com/blog/statistics-­‐and-­‐quality-­‐data-­‐analysis/large-­‐samples-­‐too-­‐much-­‐of-­‐ a-­‐good-­‐thing