Getting it the rightest
you can
Thomas Hargrove, Scripps News
John Perry, Atlanta Journal-Constitution
Janet Roberts, Reuters
Jennifer LaFleur, Reveal | The Center for Investigative Reporting
IRE 2015 CAR Conference, Atlanta
Beware duplicates
Every time Saint Paul, Minn.,
housing inspectors made follow-
up visits to check on violations, all
of the data entries from the
previous visit were logged again.
So every violation was listed in the
database multiple times.
Do integrity checks from your desk
Beware dates
Did 592,000
people in Ohio
really vote before
they registered?
Do integrity checks from your desk
Does it make sense?
“We select things for publication
just to make available a wide
scope of data to the public ...
There is some burden on the
public to decide whether or not to
use the material.”
--Kathleen McGuire,
Sourcebook of Criminal Justice Statistics
(a/k/a: The case of the disappearing lifers)
Do integrity checks from your desk
Do the data conform
to the real world?
Are half of the records male,
half female?
In a national data set, are
about 13 percent of the
records from California?
Are racial minorities
adequately represented?
Do integrity checks from your desk
Check for patterns
in missing data.
Do patterns render
estimates inaccurate?
Do integrity checks from your desk
Think like a statistician
Do integrity checks from your desk
a/k/a: How George Will became
the darling of statistics teachers
"In 1992-93, none of the five states with the highest teachers'
salaries were among the 15 states with the highest SAT scores.
And the 10 states with the lowest per pupil spending included
four . . . among the 10 states with the highest SAT scores."
--George Will, 1993
Statistical checks: From the simple to
the sophisticated
Do integrity checks from your desk
R-squared = .82
ss2 = 43 + 0.95(ss1)
Descriptive statistics:
Frequency
Average
Mode
Beware the documentation
Do integrity checks with other sources
Yes, that’s Harold Spaeth’s view and
mostly I think he’s right, though I’d
substitute the word more “efficient”
for more “accurate.”
--Lee Epstein
(Find a power user, and compare notes.)
What’s
missing?
An estimated 30
percent of felony
convictions are
missing from the
Minnesota public
convictions file.
(ask the keepers
of the data)
Do integrity checks with other sources
Check those codes
Do integrity checks with other sources
(a/k/a: The codes are not what they seem)
Data spanned six years. Sometime
in those six years, the violation
codes changed. No one in the
Housing Violations Bureau knew
when the switch was made, and
no one had definitions for the
previous codes.
(a/k/a: Why to pull some paper records)
Beware
elements
of change
Do integrity checks with other sources
The “feename” – name
of the property owner –
in the Saint Paul Housing
Bureau’s code violations
database is pulled in
from property tax rolls. It
shows the current
owner. That person may
not have owned the
property at the time of
the violation.
(a/k/a: Why to pull
some paper records)
Summarize cases by institutions,
then spot check results.
Do integrity checks with other sources
Is it true only 6 percent of hospital emergency cases
are transferred from other hospitals?
Beware nulls!
Technology bites
Null scariness from the FDA’s MAUDE database
http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm
Beware nulls!
Technology bites
We want to explore reports involving Promus heart
stents , but NOT the Promus Element devices.
First, let’s see what’s in there for Promus.
Beware nulls!Technology bites
There are 50 records that mention Promus.
We can see by scrolling that four are the
Promus Element that we wish to exclude.
Beware nulls!
Technology bites
Let’s get rid of those Elements.
Beware nulls!Technology bites
50 – 4 = …..????
Beware nulls!Technology bites
You’re supposed to have 46 records, but you
got 30. What are the missing 16 records?
Beware nulls!
Technology bites
Right:
Wrong:
Beware false joins in
"encrypted“ data.
Technology bites
Medicare 5 percent
sample: Doctors IDs
were encrypted in
some files, not in
others.
Don’t alter
original data.
As you report and just before you publish
Make a copy of the
original data file. Put it
somewhere and don’t
touch it again.
Don’t edit an original
column or field. Make a
copy and edit that.
Document
as you go
As you report and just before you publish
Keep track of all of
your queries so you
can retrace your
steps or find where
you went wrong.
As you integrity
check your data,
annotate the
queries to
remember what
you learned.
Cross check
As you report and just before you publish
If you summed data in SQL, can you reproduce the results
in a pivot table?
If you’re summing, do a list. Make sure there‘s nothing
wacky in that list that would cause your count to be
wrong; e.g., duplicates.
If you have various data sources that should yield the
same conclusions, do they?
Beware the single case
As you report and just before you publish
Never report on one data record without pulling the
paper report or talking to the person in question.
What if it was a data entry error?
What if there are circumstances you don’t understand?
Recreate the wheel
As you report and just before you publish
For every fact, number,
finding in your story,
write an original query or
formula to support it.
Go back to your original
data.
Try to arrive at the same
conclusion in different
ways.
Fear is your friend

Getting it the rightest

  • 1.
    Getting it therightest you can Thomas Hargrove, Scripps News John Perry, Atlanta Journal-Constitution Janet Roberts, Reuters Jennifer LaFleur, Reveal | The Center for Investigative Reporting IRE 2015 CAR Conference, Atlanta
  • 2.
    Beware duplicates Every timeSaint Paul, Minn., housing inspectors made follow- up visits to check on violations, all of the data entries from the previous visit were logged again. So every violation was listed in the database multiple times. Do integrity checks from your desk
  • 3.
    Beware dates Did 592,000 peoplein Ohio really vote before they registered? Do integrity checks from your desk
  • 4.
    Does it makesense? “We select things for publication just to make available a wide scope of data to the public ... There is some burden on the public to decide whether or not to use the material.” --Kathleen McGuire, Sourcebook of Criminal Justice Statistics (a/k/a: The case of the disappearing lifers) Do integrity checks from your desk
  • 5.
    Do the dataconform to the real world? Are half of the records male, half female? In a national data set, are about 13 percent of the records from California? Are racial minorities adequately represented? Do integrity checks from your desk
  • 6.
    Check for patterns inmissing data. Do patterns render estimates inaccurate? Do integrity checks from your desk
  • 7.
    Think like astatistician Do integrity checks from your desk a/k/a: How George Will became the darling of statistics teachers "In 1992-93, none of the five states with the highest teachers' salaries were among the 15 states with the highest SAT scores. And the 10 states with the lowest per pupil spending included four . . . among the 10 states with the highest SAT scores." --George Will, 1993
  • 8.
    Statistical checks: Fromthe simple to the sophisticated Do integrity checks from your desk R-squared = .82 ss2 = 43 + 0.95(ss1) Descriptive statistics: Frequency Average Mode
  • 9.
    Beware the documentation Dointegrity checks with other sources Yes, that’s Harold Spaeth’s view and mostly I think he’s right, though I’d substitute the word more “efficient” for more “accurate.” --Lee Epstein (Find a power user, and compare notes.)
  • 10.
    What’s missing? An estimated 30 percentof felony convictions are missing from the Minnesota public convictions file. (ask the keepers of the data) Do integrity checks with other sources
  • 11.
    Check those codes Dointegrity checks with other sources (a/k/a: The codes are not what they seem) Data spanned six years. Sometime in those six years, the violation codes changed. No one in the Housing Violations Bureau knew when the switch was made, and no one had definitions for the previous codes. (a/k/a: Why to pull some paper records)
  • 12.
    Beware elements of change Do integritychecks with other sources The “feename” – name of the property owner – in the Saint Paul Housing Bureau’s code violations database is pulled in from property tax rolls. It shows the current owner. That person may not have owned the property at the time of the violation. (a/k/a: Why to pull some paper records)
  • 13.
    Summarize cases byinstitutions, then spot check results. Do integrity checks with other sources Is it true only 6 percent of hospital emergency cases are transferred from other hospitals?
  • 14.
    Beware nulls! Technology bites Nullscariness from the FDA’s MAUDE database http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm
  • 15.
    Beware nulls! Technology bites Wewant to explore reports involving Promus heart stents , but NOT the Promus Element devices. First, let’s see what’s in there for Promus.
  • 16.
    Beware nulls!Technology bites Thereare 50 records that mention Promus. We can see by scrolling that four are the Promus Element that we wish to exclude.
  • 17.
    Beware nulls! Technology bites Let’sget rid of those Elements.
  • 18.
  • 19.
    Beware nulls!Technology bites You’resupposed to have 46 records, but you got 30. What are the missing 16 records?
  • 20.
  • 21.
    Beware false joinsin "encrypted“ data. Technology bites Medicare 5 percent sample: Doctors IDs were encrypted in some files, not in others.
  • 22.
    Don’t alter original data. Asyou report and just before you publish Make a copy of the original data file. Put it somewhere and don’t touch it again. Don’t edit an original column or field. Make a copy and edit that.
  • 23.
    Document as you go Asyou report and just before you publish Keep track of all of your queries so you can retrace your steps or find where you went wrong. As you integrity check your data, annotate the queries to remember what you learned.
  • 24.
    Cross check As youreport and just before you publish If you summed data in SQL, can you reproduce the results in a pivot table? If you’re summing, do a list. Make sure there‘s nothing wacky in that list that would cause your count to be wrong; e.g., duplicates. If you have various data sources that should yield the same conclusions, do they?
  • 25.
    Beware the singlecase As you report and just before you publish Never report on one data record without pulling the paper report or talking to the person in question. What if it was a data entry error? What if there are circumstances you don’t understand?
  • 26.
    Recreate the wheel Asyou report and just before you publish For every fact, number, finding in your story, write an original query or formula to support it. Go back to your original data. Try to arrive at the same conclusion in different ways.
  • 27.