Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Frontiers of Computational Journalism week 4 - Statistical Inference

38 views

Published on

Taught at Columbia Journalism School, Fall 2018
Full syllabus and lecture videos at http://www.compjournalism.com/?p=218

Published in: Education
  • Be the first to comment

  • Be the first to like this

Frontiers of Computational Journalism week 4 - Statistical Inference

  1. 1. Frontiers of Computational Journalism Columbia Journalism School Week 4: Quantification and Statistical Inference October 3, 2018
  2. 2. This class • Quantification • Data Quality • Risk ratios • Regression • Causation • Interpretation
  3. 3. Quantification
  4. 4. Quantification x1 x2 x3 xN é ë ê ê ê ê ê ê ê ù û ú ú ú ú ú ú ú
  5. 5. Different types of counting • Numeric o Continuous or discrete o Units of measurement? o Non-linear scales? • Categorical o finite, e.g. {true, false} o infinite e.g. {red, yellow, blue, ... chartreuse…} o ordered?
  6. 6. Choices about what to count
  7. 7. GDP = C + I + G + (X - M)
  8. 8. 1940 U.S. census enumerator instructions
  9. 9. 2010 U.S. census race and ethnicity questions
  10. 10. Some things that are tricky to quantify, but usefully quantified anyway • Intelligence • Academic performance • Race, ethnicity, nationality, gender • Number of incidents of some type • Income • Political Ideology
  11. 11. Data Quality
  12. 12. Intentional or unintentional problems
  13. 13. It looks like Lucknow and Kanpur have few traffic accidents, but deaths data suggests that accidents are not being counted. Lies and Statistics: How India’s Most-Populous State Fudges Crime Data, IndiaSpend
  14. 14. Evaluating Data Quality Internal validity: check the data against itself • row counts (e.g. all 50 states?) • related data • histograms • do the numbers add up? External validity: compare the data to something else. • alternate data sources • expert knowledge • previous versions • common sense!
  15. 15. Interview the Data • Who created this data? • What is this data supposed to count? • How was this data actually collected? • Does it really count what it’s suppose to? • For what purpose was this data collected? • How do we know it is complete? • If the data was collected from people, who was asked and how?
  16. 16. • Who is going to look bad or lose money because of this data? • Is the data consistent with other sources? • Is the data consistent from day to day, or when collected by different people? • Who has already analyzed it? • Are there multiple versions? • Does this data have known problems? Interview the Data
  17. 17. Risk ratios
  18. 18. Deadly Force in Black and White, ProPublica 10/10/2014
  19. 19. AP Clinton Foundation Story WASHINGTON (AP) — More than half the people outside the government who met with Hillary Clinton while she was secretary of state gave money — either personally or through companies or groups — to the Clinton Foundation. It’s an extraordinary proportion indicating her possible ethics challenges if elected president. At least 85 of 154 people from private interests who met or had phone conversations scheduled with Clinton while she led the State Department donated to her family charity or pledged commitments to its international programs, according to a review of State Department calendars released so far to The Associated Press. Combined, the 85 donors contributed as much as $156 million. At least 40 donated more than $100,000 each, and 20 gave more than $1 million. - Many donors to Clinton Foundation met with her at State, AP, 8/24/2016
  20. 20. Accident No Accident Blue Yellow
  21. 21. Relative risk (risk ratio)
  22. 22. AP Clinton Foundation Story “At least 85 of 154 people from private interests who met or had phone conversations scheduled with Clinton while she led the State Department donated to her family charity or pledged commitments to its international programs, according to a review of State Department calendars.” odds
  23. 23. AP Clinton Foundation Story odds Not enough information to compute the odds ratio... which you can tell immediately because four values are required.
  24. 24. Regression
  25. 25. Speed Trap: Who gets a ticket, who gets a break? Boston Globe, 2004
  26. 26. Speed Trap: Who gets a ticket, who gets a break? Boston Globe, 2004
  27. 27. Speed Trap: Who gets a ticket, who gets a break? Boston Globe, 2004
  28. 28. Nike Says Its $250 Running Shoes Will Make You Run Much Faster, New York Times
  29. 29. Surgeon Scorecard, ProPublica 2015
  30. 30. ACR = adjusted complication rate (reported in story) Surgeon Scorecard methodology paper, ProPublica 2015
  31. 31. Causal Models
  32. 32. Does chocolate make you smarter?
  33. 33. Occupational Group Smoking Mortality Farmers, foresters, and fisherman 77 84 Miners and quarrymen 137 116 Gas, coke and chemical makers 117 123 Glass and ceramics makers 94 128 Furnace, forge, foundry, and rolling mill 116 155 Electrical and electronics workers 102 101 Engineering and allied trades 111 118 Woodworkers 93 113 Leather workers 88 104 Textile workers 102 88 Clothing workers 91 104 Food, drink, and tobacco workers 104 129 Paper and printing workers 107 86 Makers of other products 112 96
  34. 34. Does marriage make women safer?
  35. 35. How correlation happens YX X causes Y YX Y causes X YX random chance! YX hidden variable causes X and Y YX Z causes X and Y Z
  36. 36. Guns and firearm homicides? YX if you have a gun, you're going to use it YX if it's a dangerous neighborhood, you'll buy a gun YX the correlation is due to chance
  37. 37. Beauty and responses YX telling a woman she's beautiful makes her respond less YX if a woman is beautiful, 1) she'll respond less 2) people will tell her that Z Beauty is a "confounding variable." The correlation is real, but you've misunderstood the causal structure.
  38. 38. A causal network. From Statistical Modeling: A Fresh Approach
  39. 39. What an experiment is: intervene in a network of causes
  40. 40. Does Facebook news feed cause people to share links?
  41. 41. Interpretation generally
  42. 42. Same data, different meaning
  43. 43. More than one true story
  44. 44. More than one true story
  45. 45. Crime in context, The Marshall Project

×