Frontiers of
Computational Journalism
Columbia Journalism School
Week 4: Quantification and Statistical Inference
October 3, 2018
This class
• Quantification
• Data Quality
• Risk ratios
• Regression
• Causation
• Interpretation
Quantification
Quantification
x1
x2
x3
xN
é
ë
ê
ê
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
ú
ú
Different types of counting
• Numeric
o Continuous or discrete
o Units of measurement?
o Non-linear scales?
• Categorical
o finite, e.g. {true, false}
o infinite e.g. {red, yellow, blue, ... chartreuse…}
o ordered?
Choices about what to count
GDP = C + I + G + (X - M)
1940 U.S. census enumerator
instructions
2010 U.S. census race and
ethnicity questions
Some things that are tricky to quantify,
but usefully quantified anyway
• Intelligence
• Academic performance
• Race, ethnicity, nationality, gender
• Number of incidents of some type
• Income
• Political Ideology
Data Quality
Intentional or unintentional problems
It looks like Lucknow and Kanpur have few traffic accidents, but
deaths data suggests that accidents are not being counted.
Lies and Statistics: How India’s Most-Populous State Fudges Crime Data, IndiaSpend
Evaluating Data Quality
Internal validity: check the data against itself
• row counts (e.g. all 50 states?)
• related data
• histograms
• do the numbers add up?
External validity: compare the data to something else.
• alternate data sources
• expert knowledge
• previous versions
• common sense!
Interview the Data
• Who created this data?
• What is this data supposed to count?
• How was this data actually collected?
• Does it really count what it’s suppose to?
• For what purpose was this data collected?
• How do we know it is complete?
• If the data was collected from people, who was asked
and how?
• Who is going to look bad or lose money because of this
data?
• Is the data consistent with other sources?
• Is the data consistent from day to day, or when collected
by different people?
• Who has already analyzed it?
• Are there multiple versions?
• Does this data have known problems?
Interview the Data
Risk ratios
Deadly Force in Black and White, ProPublica 10/10/2014
AP Clinton Foundation Story
WASHINGTON (AP) — More than half the people outside the government who
met with Hillary Clinton while she was secretary of state gave money — either
personally or through companies or groups — to the Clinton Foundation. It’s an
extraordinary proportion indicating her possible ethics challenges if elected
president.
At least 85 of 154 people from private interests who met or had phone
conversations scheduled with Clinton while she led the State Department
donated to her family charity or pledged commitments to its international
programs, according to a review of State Department calendars released so far
to The Associated Press. Combined, the 85 donors contributed as much as
$156 million. At least 40 donated more than $100,000 each, and 20 gave more
than $1 million.
- Many donors to Clinton Foundation met with her at State, AP, 8/24/2016
Accident
No Accident
Blue Yellow
Relative risk (risk ratio)
AP Clinton Foundation Story
“At least 85 of 154 people from private interests who met or had phone
conversations scheduled with Clinton while she led the State Department
donated to her family charity or pledged commitments to its international
programs, according to a review of State Department calendars.”
odds
AP Clinton Foundation Story
odds
Not enough information to compute the odds ratio...
which you can tell immediately because four values are required.
Regression
Speed Trap: Who gets a ticket, who gets a break? Boston Globe, 2004
Speed Trap: Who gets a ticket, who gets a break? Boston Globe, 2004
Speed Trap: Who gets a ticket, who gets a break? Boston Globe, 2004
Nike Says Its $250 Running Shoes Will Make You Run Much Faster,
New York Times
Surgeon Scorecard, ProPublica 2015
ACR = adjusted complication rate (reported in story)
Surgeon Scorecard methodology paper, ProPublica 2015
Causal Models
Does chocolate make you smarter?
Occupational Group Smoking Mortality
Farmers, foresters, and fisherman 77 84
Miners and quarrymen 137 116
Gas, coke and chemical makers 117 123
Glass and ceramics makers 94 128
Furnace, forge, foundry, and rolling mill 116 155
Electrical and electronics workers 102 101
Engineering and allied trades 111 118
Woodworkers 93 113
Leather workers 88 104
Textile workers 102 88
Clothing workers 91 104
Food, drink, and tobacco workers 104 129
Paper and printing workers 107 86
Makers of other products 112 96
Does marriage make women safer?
How correlation happens
YX
X causes Y
YX
Y causes X
YX
random chance!
YX
hidden variable causes X and Y
YX
Z causes X and Y
Z
Guns and firearm homicides?
YX
if you have a gun, you're going to use it
YX
if it's a dangerous neighborhood, you'll buy a gun
YX
the correlation is due to chance
Beauty and responses
YX
telling a woman she's beautiful
makes her respond less
YX
if a woman is beautiful,
1) she'll respond less
2) people will tell her that
Z
Beauty is a "confounding variable." The correlation is real,
but you've misunderstood the causal structure.
A causal network. From Statistical Modeling: A Fresh Approach
What an experiment is:
intervene in a network of causes
Does Facebook news feed cause
people to share links?
Interpretation generally
Same data, different meaning
More than one true story
More than one true story
Crime in context, The Marshall Project

Frontiers of Computational Journalism week 4 - Statistical Inference

  • 1.
    Frontiers of Computational Journalism ColumbiaJournalism School Week 4: Quantification and Statistical Inference October 3, 2018
  • 2.
    This class • Quantification •Data Quality • Risk ratios • Regression • Causation • Interpretation
  • 3.
  • 4.
  • 5.
    Different types ofcounting • Numeric o Continuous or discrete o Units of measurement? o Non-linear scales? • Categorical o finite, e.g. {true, false} o infinite e.g. {red, yellow, blue, ... chartreuse…} o ordered?
  • 6.
  • 11.
    GDP = C+ I + G + (X - M)
  • 12.
    1940 U.S. censusenumerator instructions
  • 13.
    2010 U.S. censusrace and ethnicity questions
  • 14.
    Some things thatare tricky to quantify, but usefully quantified anyway • Intelligence • Academic performance • Race, ethnicity, nationality, gender • Number of incidents of some type • Income • Political Ideology
  • 15.
  • 16.
  • 17.
    It looks likeLucknow and Kanpur have few traffic accidents, but deaths data suggests that accidents are not being counted. Lies and Statistics: How India’s Most-Populous State Fudges Crime Data, IndiaSpend
  • 18.
    Evaluating Data Quality Internalvalidity: check the data against itself • row counts (e.g. all 50 states?) • related data • histograms • do the numbers add up? External validity: compare the data to something else. • alternate data sources • expert knowledge • previous versions • common sense!
  • 19.
    Interview the Data •Who created this data? • What is this data supposed to count? • How was this data actually collected? • Does it really count what it’s suppose to? • For what purpose was this data collected? • How do we know it is complete? • If the data was collected from people, who was asked and how?
  • 20.
    • Who isgoing to look bad or lose money because of this data? • Is the data consistent with other sources? • Is the data consistent from day to day, or when collected by different people? • Who has already analyzed it? • Are there multiple versions? • Does this data have known problems? Interview the Data
  • 21.
  • 22.
    Deadly Force inBlack and White, ProPublica 10/10/2014
  • 23.
    AP Clinton FoundationStory WASHINGTON (AP) — More than half the people outside the government who met with Hillary Clinton while she was secretary of state gave money — either personally or through companies or groups — to the Clinton Foundation. It’s an extraordinary proportion indicating her possible ethics challenges if elected president. At least 85 of 154 people from private interests who met or had phone conversations scheduled with Clinton while she led the State Department donated to her family charity or pledged commitments to its international programs, according to a review of State Department calendars released so far to The Associated Press. Combined, the 85 donors contributed as much as $156 million. At least 40 donated more than $100,000 each, and 20 gave more than $1 million. - Many donors to Clinton Foundation met with her at State, AP, 8/24/2016
  • 24.
  • 25.
  • 26.
    AP Clinton FoundationStory “At least 85 of 154 people from private interests who met or had phone conversations scheduled with Clinton while she led the State Department donated to her family charity or pledged commitments to its international programs, according to a review of State Department calendars.” odds
  • 27.
    AP Clinton FoundationStory odds Not enough information to compute the odds ratio... which you can tell immediately because four values are required.
  • 28.
  • 29.
    Speed Trap: Whogets a ticket, who gets a break? Boston Globe, 2004
  • 30.
    Speed Trap: Whogets a ticket, who gets a break? Boston Globe, 2004
  • 31.
    Speed Trap: Whogets a ticket, who gets a break? Boston Globe, 2004
  • 32.
    Nike Says Its$250 Running Shoes Will Make You Run Much Faster, New York Times
  • 33.
  • 34.
    ACR = adjustedcomplication rate (reported in story) Surgeon Scorecard methodology paper, ProPublica 2015
  • 35.
  • 36.
    Does chocolate makeyou smarter?
  • 37.
    Occupational Group SmokingMortality Farmers, foresters, and fisherman 77 84 Miners and quarrymen 137 116 Gas, coke and chemical makers 117 123 Glass and ceramics makers 94 128 Furnace, forge, foundry, and rolling mill 116 155 Electrical and electronics workers 102 101 Engineering and allied trades 111 118 Woodworkers 93 113 Leather workers 88 104 Textile workers 102 88 Clothing workers 91 104 Food, drink, and tobacco workers 104 129 Paper and printing workers 107 86 Makers of other products 112 96
  • 40.
    Does marriage makewomen safer?
  • 43.
    How correlation happens YX Xcauses Y YX Y causes X YX random chance! YX hidden variable causes X and Y YX Z causes X and Y Z
  • 44.
    Guns and firearmhomicides? YX if you have a gun, you're going to use it YX if it's a dangerous neighborhood, you'll buy a gun YX the correlation is due to chance
  • 46.
    Beauty and responses YX tellinga woman she's beautiful makes her respond less YX if a woman is beautiful, 1) she'll respond less 2) people will tell her that Z Beauty is a "confounding variable." The correlation is real, but you've misunderstood the causal structure.
  • 47.
    A causal network.From Statistical Modeling: A Fresh Approach
  • 48.
    What an experimentis: intervene in a network of causes
  • 49.
    Does Facebook newsfeed cause people to share links?
  • 50.
  • 51.
  • 52.
    More than onetrue story
  • 53.
    More than onetrue story
  • 54.
    Crime in context,The Marshall Project