• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Drawing conclusions from data
 

Drawing conclusions from data

on

  • 712 views

How data journalists can get stories from their data without fooling themselves. Includes statistical testing, correlation, causation, and more. A talk at the 15th Annual Science Immersion Workshop ...

How data journalists can get stories from their data without fooling themselves. Includes statistical testing, correlation, causation, and more. A talk at the 15th Annual Science Immersion Workshop for Journalists at the Metcalf Institute for Marine & Environmental Reporting, Rhode Island.

Statistics

Views

Total Views
712
Views on SlideShare
712
Embed Views
0

Actions

Likes
2
Downloads
22
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution-NonCommercial LicenseCC Attribution-NonCommercial License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Drawing conclusions from data Drawing conclusions from data Presentation Transcript

    • Drawing  Conclusions  from  Data   An  introduc4on  to  sta4s4cal  tes4ng   without  equa4ons   Metcalf  Ins4tute  15th  Annual     Science  Immersion  Workshop     for  Journalists     Jonathan  Stray   Columbia  University  
    • You  see  a  story  in  the  data   Is  it  really  there?  
    • Why  wouldn't  there  be  a  story?   •  You  misunderstand  how  the  data  is  collected   •  The  data  is  incomplete  or  bad     •  The  paIern  is  due  to  chance   •  The  paIern  is  real,  but  it  isn't  a  causal   rela4onship   •  The  data  doesn't  generalize  the  way  you  want   it  to  
    • How  was  this  data  created?  
    • Inten4onal  or  uninten4onal  problems  
    • What   doesn't  a   TwiIer   map   show?  
    • NYC   popula4on colored  by   income  
    • "Interview  the  data"   Where  do  these  numbers  come  from?  Who  recorded   them?  How?  For  what  purpose  was  this  data  collected?   How  do  we  know  it  is  complete?  What  are  the   demographics?  Is  this  the  right  way  to  quan4fy  this   issue?  Who  is  not  included  in  these  figures?  Who  is   going  to  look  bad  or  lose  money  as  a  result  of  these   numbers?  What  arbitrary  choices  had  to  be  made  to   generate  the  data?  Is  the  data  consistent  with  other   sources?  Who  has  already  analyzed  it?  Does  it  have   known  flaws?  Are  there  mul4ple  versions?   .   .  
    • What  stats  know-­‐how  gets  you   •  You  misunderstand  how  the  data  is  collected   •  The  data  is  incomplete  or  bad     •  The  paIern  is  due  to  chance   •  The  paIern  is  real,  but  it  isn't  a  causal   rela4onship   •  The  data  doesn't  generalize  the  way  you  want   it  to  
    • Sta4s4cal  tes4ng   Assumes  the  data  is  good,  but  includes  an   element  of  chance.   Then  the  first  ques4on  is:  is  the  paIern  a   coincidence...  or  not?     Or:  Is  "coincidence"  consistent  with  the  data?    
    • Is  this  die  loaded?  
    • First  rule  of  sta4s4cs   Smaller  samples  have  more  variance.     That's  why  more  data  is  always  beIer,  from  the   point  of  view  of  sta4s4cal  tes4ng.     More  data  increases  "sta4s4cal  power."  
    • Are  these  two  dice  loaded?  
    • Two  dice:  non-­‐uniform  distribu4on  
    • The  Null  Hypothesis,  H0       The  paIern  I  see  is  just  due  to  chance.     Distribu4on  of  data  under  H0  =     what  might  the  data  look  like  if     generated  purely  by  chance?  
    • Comparing  two  sets  of  numbers     Let's  say  you  measure  the  grades  of  students   taught  by  two  different  classes  and  the  average   is  different.     Is  this  evidence  that  something  is  different   between  the  two  classes?  
    • Construc4ng  the  null  distribu4on     We  don't  have  a  theore4cal  argument  (like  the   dice.)     But  if  the  two  classes  are  really  the  same,  then   we  can  switch  students  between  them  if  we   want,  and  the  null  hypothesis  will  s4ll  hold.  
    • Construc4ng  the  null  distribu4on     Observed  data   Class  A  =  0.90    0.93    1.25    1.24  1.38  0.94  1.14  0.73  1.46   Class  B  =  1.15    0.88    0.90    0.74    1.21     Permuted  Data   Class  A  =  1.25  0.90  0.90  0.93  0.74  0.73  0.94  1.15  0.88   Class  B  =  1.21  1.14  1.46  1.38  1.24      
    • observed  difference  
    • How  sure  do  we  need  to  be?     These  plots  of  the  null  distribu4on  show  us  how   oeen  H0  will  look  like  a  paIern.  This  is  the  "p-­‐ value."     If  p-­‐value  is  lower,  we  have  stronger  evidence   that  what  we  see  is  "real."     How  low  is  low  enough?  
    • Two  values   "Significance"  is  how  sure  you  are  that  the  effect   you're  seeing  is  real.  "Effect  size"  is  how  big  the   effect  is.     Example:  class  grades  differed  by  3%,  p<0.05     Warning:  large  significance  doesn't  mean  a  large   effect!  
    • I  see  a  trend.  Is  it  "real"?  
    • Looking  for  correla4ons     Suppose  you  want  to  know  if  more  firearms   correlate  with  more  firearm  homicides.     First,  a  scaIerplot.  
    • Construc4ng  the  null  distribu4on     Again,  we  don't  have  a  theore4cal  distribu4on   that  tells  us  what  the  distribu4ons  of  firearms   and  homicides  should  be,  if  they're   independent.     But...  
    • Construc4ng  the  null  distribu4on     But...  if  X  and  Y  variables  are  truly  independent,   then  switching  which  X  goes  with  which  Y  won't   make  any  difference.  
    • A  correla4on  puzzle     Suppose  I  you  discover  that  the  students  with   the  top  5%  of  standardized  test  scores  come   from  smaller  classes.     Why?      
    • Have  we  learned  nothing?     Smaller  samples  will  always  have  higher   variance.     So  the  smaller  classes  will  have  higher  scores.   They  will  also  have  lower  scores.     Protect  yourself  from  reasoning  errors:  always   plot  null  distribu4ons.    
    • Have  I  really  found  the  cause?  
    • Suppose  you  apply  a  sta4s4cal  test,  and  the   smaller  classes  really  are  unlikely  to  have  scores   this  high  by  chance.     Was  it  really  because  of  the  smaller  class  size?      
    • You  will  invent  stories     about  your  data  
    • How  correla4on  happens   YX X  causes  Y   YX Y  causes  X   YX random  chance!   YX hidden  variable  causes  X  and  Y   YX Z  causes  X  and  Y   Z
    • Guns  and  firearm  homicides?   YX if  you  have  a  gun,  you're  going  to  use  it   YX if  it's  a  dangerous  neighborhood,  you'll  buy  a  gun   YX the  correla4on  is  due  to  chance  
    • Beauty  and  responses   YX telling  a  woman  she's  beau4ful  doesn't  work   YX if  a  woman  is  beau4ful,      1)  she'll  respond  less      2)  people  will  tell  her  that   Z Beauty  is  a  "confounding  variable."  The  correla4on  is   real,  but  you've  misunderstood  the  causal  structure.  
    • Suppose  you  apply  a  sta4s4cal  test,  and  the   smaller  classes  really  are  unlikely  to  have  scores   this  high  by  chance.     Will  the  same  thing  be  true  in  other  states?         Generalizability  
    • Are  those  three  students  you  interviewed  really   representa4ve  of  all  students?     Everyone  you  know  is  talking  about  it,  but  is  everyone   else?     What's  the  margin  of  error  of  this  poll?     The  sta4s4cs  of  generalizability:  another  4me...   Generalizability  
    • In  Short   •  First  ask  about  what  the  data  means,  where  it   came  from,  and  if  it's  good.   •  Then  ask  about  coincidence.  Get  a  look  at  the   null  distribu4on.   •  If  the  correla4on  is  significant,  then  ask  about   causality.  Rule  out  each  case.   •  Are  your  results  standing  in  for  things  you   don't  actually  have  data  on?