Drawing	
  Conclusions	
  from	
  Data	
  
An	
  introduc4on	
  to	
  sta4s4cal	
  tes4ng	
  
without	
  equa4ons	
  
Metc...
You	
  see	
  a	
  story	
  in	
  the	
  data	
  
Is	
  it	
  really	
  there?	
  
Why	
  wouldn't	
  there	
  be	
  a	
  story?	
  
•  You	
  misunderstand	
  how	
  the	
  data	
  is	
  collected	
  
•  ...
How	
  was	
  this	
  data	
  created?	
  
Inten4onal	
  or	
  uninten4onal	
  problems	
  
What	
  
doesn't	
  a	
  
TwiIer	
  
map	
  
show?	
  
NYC	
  
popula4on
colored	
  by	
  
income	
  
"Interview	
  the	
  data"	
  
Where	
  do	
  these	
  numbers	
  come	
  from?	
  Who	
  recorded	
  
them?	
  How?	
  Fo...
What	
  stats	
  know-­‐how	
  gets	
  you	
  
•  You	
  misunderstand	
  how	
  the	
  data	
  is	
  collected	
  
•  The...
Sta4s4cal	
  tes4ng	
  
Assumes	
  the	
  data	
  is	
  good,	
  but	
  includes	
  an	
  
element	
  of	
  chance.	
  
Th...
Is	
  this	
  die	
  loaded?	
  
First	
  rule	
  of	
  sta4s4cs	
  
Smaller	
  samples	
  have	
  more	
  variance.	
  
	
  
That's	
  why	
  more	
  data...
Are	
  these	
  two	
  dice	
  loaded?	
  
Two	
  dice:	
  non-­‐uniform	
  distribu4on	
  
The	
  Null	
  Hypothesis,	
  H0	
  
	
  
	
  
The	
  paIern	
  I	
  see	
  is	
  just	
  due	
  to	
  chance.	
  
	
  
Di...
Comparing	
  two	
  sets	
  of	
  numbers	
  
	
  
Let's	
  say	
  you	
  measure	
  the	
  grades	
  of	
  students	
  
t...
Construc4ng	
  the	
  null	
  distribu4on	
  
	
  
We	
  don't	
  have	
  a	
  theore4cal	
  argument	
  (like	
  the	
  
...
Construc4ng	
  the	
  null	
  distribu4on	
  
	
  
Observed	
  data	
  
Class	
  A	
  =	
  0.90	
  	
  0.93	
  	
  1.25	
 ...
observed	
  difference	
  
How	
  sure	
  do	
  we	
  need	
  to	
  be?	
  
	
  
These	
  plots	
  of	
  the	
  null	
  distribu4on	
  show	
  us	
  ...
Two	
  values	
  
"Significance"	
  is	
  how	
  sure	
  you	
  are	
  that	
  the	
  effect	
  
you're	
  seeing	
  is	
  r...
I	
  see	
  a	
  trend.	
  Is	
  it	
  "real"?	
  
Looking	
  for	
  correla4ons	
  
	
  
Suppose	
  you	
  want	
  to	
  know	
  if	
  more	
  firearms	
  
correlate	
  with...
Construc4ng	
  the	
  null	
  distribu4on	
  
	
  
Again,	
  we	
  don't	
  have	
  a	
  theore4cal	
  distribu4on	
  
tha...
Construc4ng	
  the	
  null	
  distribu4on	
  
	
  
But...	
  if	
  X	
  and	
  Y	
  variables	
  are	
  truly	
  independe...
A	
  correla4on	
  puzzle	
  
	
  
Suppose	
  I	
  you	
  discover	
  that	
  the	
  students	
  with	
  
the	
  top	
  5%...
Have	
  we	
  learned	
  nothing?	
  
	
  
Smaller	
  samples	
  will	
  always	
  have	
  higher	
  
variance.	
  
	
  
S...
Have	
  I	
  really	
  found	
  the	
  cause?	
  
Suppose	
  you	
  apply	
  a	
  sta4s4cal	
  test,	
  and	
  the	
  
smaller	
  classes	
  really	
  are	
  unlikely	
  to...
You	
  will	
  invent	
  stories	
  	
  
about	
  your	
  data	
  
How	
  correla4on	
  happens	
  
YX
X	
  causes	
  Y	
  
YX
Y	
  causes	
  X	
  
YX
random	
  chance!	
  
YX
hidden	
  var...
Guns	
  and	
  firearm	
  homicides?	
  
YX
if	
  you	
  have	
  a	
  gun,	
  you're	
  going	
  to	
  use	
  it	
  
YX
if	...
Beauty	
  and	
  responses	
  
YX
telling	
  a	
  woman	
  she's	
  beau4ful	
  doesn't	
  work	
  
YX
if	
  a	
  woman	
 ...
Suppose	
  you	
  apply	
  a	
  sta4s4cal	
  test,	
  and	
  the	
  
smaller	
  classes	
  really	
  are	
  unlikely	
  to...
Are	
  those	
  three	
  students	
  you	
  interviewed	
  really	
  
representa4ve	
  of	
  all	
  students?	
  
	
  
Eve...
In	
  Short	
  
•  First	
  ask	
  about	
  what	
  the	
  data	
  means,	
  where	
  it	
  
came	
  from,	
  and	
  if	
 ...
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Drawing conclusions from data
Upcoming SlideShare
Loading in …5
×

Drawing conclusions from data

1,398 views

Published on

How data journalists can get stories from their data without fooling themselves. Includes statistical testing, correlation, causation, and more. A talk at the 15th Annual Science Immersion Workshop for Journalists at the Metcalf Institute for Marine & Environmental Reporting, Rhode Island.

Published in: Education, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,398
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
34
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Drawing conclusions from data

  1. 1. Drawing  Conclusions  from  Data   An  introduc4on  to  sta4s4cal  tes4ng   without  equa4ons   Metcalf  Ins4tute  15th  Annual     Science  Immersion  Workshop     for  Journalists     Jonathan  Stray   Columbia  University  
  2. 2. You  see  a  story  in  the  data   Is  it  really  there?  
  3. 3. Why  wouldn't  there  be  a  story?   •  You  misunderstand  how  the  data  is  collected   •  The  data  is  incomplete  or  bad     •  The  paIern  is  due  to  chance   •  The  paIern  is  real,  but  it  isn't  a  causal   rela4onship   •  The  data  doesn't  generalize  the  way  you  want   it  to  
  4. 4. How  was  this  data  created?  
  5. 5. Inten4onal  or  uninten4onal  problems  
  6. 6. What   doesn't  a   TwiIer   map   show?  
  7. 7. NYC   popula4on colored  by   income  
  8. 8. "Interview  the  data"   Where  do  these  numbers  come  from?  Who  recorded   them?  How?  For  what  purpose  was  this  data  collected?   How  do  we  know  it  is  complete?  What  are  the   demographics?  Is  this  the  right  way  to  quan4fy  this   issue?  Who  is  not  included  in  these  figures?  Who  is   going  to  look  bad  or  lose  money  as  a  result  of  these   numbers?  What  arbitrary  choices  had  to  be  made  to   generate  the  data?  Is  the  data  consistent  with  other   sources?  Who  has  already  analyzed  it?  Does  it  have   known  flaws?  Are  there  mul4ple  versions?   .   .  
  9. 9. What  stats  know-­‐how  gets  you   •  You  misunderstand  how  the  data  is  collected   •  The  data  is  incomplete  or  bad     •  The  paIern  is  due  to  chance   •  The  paIern  is  real,  but  it  isn't  a  causal   rela4onship   •  The  data  doesn't  generalize  the  way  you  want   it  to  
  10. 10. Sta4s4cal  tes4ng   Assumes  the  data  is  good,  but  includes  an   element  of  chance.   Then  the  first  ques4on  is:  is  the  paIern  a   coincidence...  or  not?     Or:  Is  "coincidence"  consistent  with  the  data?    
  11. 11. Is  this  die  loaded?  
  12. 12. First  rule  of  sta4s4cs   Smaller  samples  have  more  variance.     That's  why  more  data  is  always  beIer,  from  the   point  of  view  of  sta4s4cal  tes4ng.     More  data  increases  "sta4s4cal  power."  
  13. 13. Are  these  two  dice  loaded?  
  14. 14. Two  dice:  non-­‐uniform  distribu4on  
  15. 15. The  Null  Hypothesis,  H0       The  paIern  I  see  is  just  due  to  chance.     Distribu4on  of  data  under  H0  =     what  might  the  data  look  like  if     generated  purely  by  chance?  
  16. 16. Comparing  two  sets  of  numbers     Let's  say  you  measure  the  grades  of  students   taught  by  two  different  classes  and  the  average   is  different.     Is  this  evidence  that  something  is  different   between  the  two  classes?  
  17. 17. Construc4ng  the  null  distribu4on     We  don't  have  a  theore4cal  argument  (like  the   dice.)     But  if  the  two  classes  are  really  the  same,  then   we  can  switch  students  between  them  if  we   want,  and  the  null  hypothesis  will  s4ll  hold.  
  18. 18. Construc4ng  the  null  distribu4on     Observed  data   Class  A  =  0.90    0.93    1.25    1.24  1.38  0.94  1.14  0.73  1.46   Class  B  =  1.15    0.88    0.90    0.74    1.21     Permuted  Data   Class  A  =  1.25  0.90  0.90  0.93  0.74  0.73  0.94  1.15  0.88   Class  B  =  1.21  1.14  1.46  1.38  1.24      
  19. 19. observed  difference  
  20. 20. How  sure  do  we  need  to  be?     These  plots  of  the  null  distribu4on  show  us  how   oeen  H0  will  look  like  a  paIern.  This  is  the  "p-­‐ value."     If  p-­‐value  is  lower,  we  have  stronger  evidence   that  what  we  see  is  "real."     How  low  is  low  enough?  
  21. 21. Two  values   "Significance"  is  how  sure  you  are  that  the  effect   you're  seeing  is  real.  "Effect  size"  is  how  big  the   effect  is.     Example:  class  grades  differed  by  3%,  p<0.05     Warning:  large  significance  doesn't  mean  a  large   effect!  
  22. 22. I  see  a  trend.  Is  it  "real"?  
  23. 23. Looking  for  correla4ons     Suppose  you  want  to  know  if  more  firearms   correlate  with  more  firearm  homicides.     First,  a  scaIerplot.  
  24. 24. Construc4ng  the  null  distribu4on     Again,  we  don't  have  a  theore4cal  distribu4on   that  tells  us  what  the  distribu4ons  of  firearms   and  homicides  should  be,  if  they're   independent.     But...  
  25. 25. Construc4ng  the  null  distribu4on     But...  if  X  and  Y  variables  are  truly  independent,   then  switching  which  X  goes  with  which  Y  won't   make  any  difference.  
  26. 26. A  correla4on  puzzle     Suppose  I  you  discover  that  the  students  with   the  top  5%  of  standardized  test  scores  come   from  smaller  classes.     Why?      
  27. 27. Have  we  learned  nothing?     Smaller  samples  will  always  have  higher   variance.     So  the  smaller  classes  will  have  higher  scores.   They  will  also  have  lower  scores.     Protect  yourself  from  reasoning  errors:  always   plot  null  distribu4ons.    
  28. 28. Have  I  really  found  the  cause?  
  29. 29. Suppose  you  apply  a  sta4s4cal  test,  and  the   smaller  classes  really  are  unlikely  to  have  scores   this  high  by  chance.     Was  it  really  because  of  the  smaller  class  size?      
  30. 30. You  will  invent  stories     about  your  data  
  31. 31. How  correla4on  happens   YX X  causes  Y   YX Y  causes  X   YX random  chance!   YX hidden  variable  causes  X  and  Y   YX Z  causes  X  and  Y   Z
  32. 32. Guns  and  firearm  homicides?   YX if  you  have  a  gun,  you're  going  to  use  it   YX if  it's  a  dangerous  neighborhood,  you'll  buy  a  gun   YX the  correla4on  is  due  to  chance  
  33. 33. Beauty  and  responses   YX telling  a  woman  she's  beau4ful  doesn't  work   YX if  a  woman  is  beau4ful,      1)  she'll  respond  less      2)  people  will  tell  her  that   Z Beauty  is  a  "confounding  variable."  The  correla4on  is   real,  but  you've  misunderstood  the  causal  structure.  
  34. 34. Suppose  you  apply  a  sta4s4cal  test,  and  the   smaller  classes  really  are  unlikely  to  have  scores   this  high  by  chance.     Will  the  same  thing  be  true  in  other  states?         Generalizability  
  35. 35. Are  those  three  students  you  interviewed  really   representa4ve  of  all  students?     Everyone  you  know  is  talking  about  it,  but  is  everyone   else?     What's  the  margin  of  error  of  this  poll?     The  sta4s4cs  of  generalizability:  another  4me...   Generalizability  
  36. 36. In  Short   •  First  ask  about  what  the  data  means,  where  it   came  from,  and  if  it's  good.   •  Then  ask  about  coincidence.  Get  a  look  at  the   null  distribu4on.   •  If  the  correla4on  is  significant,  then  ask  about   causality.  Rule  out  each  case.   •  Are  your  results  standing  in  for  things  you   don't  actually  have  data  on?    

×