Data validation in the Digital Age

“OK, but where did that data come from?”

Data validation in the
Digital Age

Tom Johnson Cheryl Phillips
Managing Director Data Enterprise Editor
Inst. for Analytic Journalism Seattle Times
Santa Fe, New Mexico USA Seattle, Washington USA
tom@jtjohnson.com
cphillips@seattletImes.com
1

Digital Age
Presentation by Cheryl Phillips and Tom Johnson at
National Institute of Computer-Assisted Reporting Conference
Date/Time: Friday, Feb. 24 at 11 a.m.
Location: Frisco/Burlington Room
St. Louis, Missouri USA

This PowerPoint deck and Tipsheets posted at:

http:// s d r v . m s / w N t i M 7

2

The methodology / = the value of the data set and your story

1
Important point

A data base (or
report) is only as
good as the
methodology used
to create it.
3

2
Data sets are living things; they have pedigree and genealogy

Important points
•Most [all?] data sets are living
things.
•And they have a pedigree, a
genealogy.
•Data sets live in a dynamic
environment.
•Understand the DB ecology

4

How bad data can do you wrong
Illinois and Missouri sex-offender DB
•“St. Louis Post-Dispatch - 2 May 1999: A11 – “ABOUT 700 SEX
OFFENDERS DO NOT APPEAR TO LIVE AT THE ADDRESSES
LISTED ON A ST. LOUIS REGISTRY; MANY SEX OFFENDERS NEVER
MAKE THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie
Luca
•Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A
“Criminal checks deficient; State's database of convictions is
hurt by lack of reporting, putting public safety at risk, law
officials say” By Diane Jennings and Darlean Spangenberger
•See stories here

How bad data can do you wrong
2011 - New Mexico Sec. of State’s “questionable
voters” data set – “The Big Bundle”
•~1.1m voters
•Previous SoS didn’t clean rolls
•Matched name, address, DoB and SS#
– SSA data base; NM driver’s licenses
– 2 variables “mismatch” =  Questionable?
– Asked State Police (not AG’s office) to investigate

Problems with Sec. of State methodology

• What’s the error rate of original DB?
• Definition of “error”? (Gonzales or Gonzalez)
• Sample(s) by county and state total?
• Error rates of comparative DBs?
• Aggregation of error problem
• 2011 Help America Vote Verification Transaction
Totals, Year-to-Date, by State
https://www.socialsecurity.gov/open/havv/havv-year-

Source: https://www.socialsecurity.gov/open/havv/havv-year-to-date-2011.html

There be dragons!

A most
Data base
wonderful
rich with story!!!
potential

9

Building genealogy for target DB

1. Pre-plan 1. Acquire latest data and
•2nd monitor related docs
•“Logbook” apps 1. Do tables conform to
1. Lit. review/ interview peers record layout?
1. Do data fit theoretical 1. Do docs specify expected
models? ranges & frequencies?
1. Do a “critical biography” of 1. Are data values missing or
the data out of range?
1. Does biography raise 1. Review major checklist
critical warnings?
1. Have others run analysis of
this data?
Source: Palmer, Griff. “Flowchart/decision tree for data base analysis.” pgs. 136-146. Ver 1.0 Proceedings, IAJ Press (Santa Fe,
NM), April 2006. http://www.lulu.com/product/paperback/ver-10-workshop-proceedings/546459


1. Pre-plan 1. Acquire latest data and
• Changes in
•2nd monitor related docs
definitions?
•“Logbook” apps 1. Do tables conform to
• review/ interview peers
1. Lit. By administrators? record layout?
• Formal or informal?
1. Do By statute?
• data fit theoretical 1. Do docs specify expected
models? ranges & frequencies?
• Changes in collection
1.methods, data entry,
Do a “critical biography” of 1. Are data values missing or
the data out of range?
vetting, updating, file
1.type/format?raise
Does biography 1. Review major checklist
critical warnings?
• Changes in users and
1.usage
Have others run analysis of
this data?
• Data cleaning

Data Quality checkpoints

• Constancy of definitions and coding categories?
• All at same time and location?
• Completeness: How many records have unfilled
cells? Are the tendencies of “nulls” consistent in
all records, variable types?
• Precision: Are the numbers rounded or?
• Hope for fine-grained, not summaries or aggregates
• Can be especially important with temporal and
geographic data, i.e. What is the range(s) of the time
scales?

Cheryl on Quant methods for
measuring data quality

Newsroom methods for
measuring data quality

• Test frequencies on key fields
Bicycle accidents in Seattle included a time field. But
it was almost always noon when accidents occurred.
Caveat: Don’t over-reach with your conclusions or
analysis

Don’t over-reach with your
analysis

– Rates are good – IF you have the data to calculate
them.

Outliers are important
Explore the reasons behind anomalies or unexpected
trends in the data.
From the state of WA: After
going back and forth with our
analyst on this, we decided it
would be easiest for her to
just pull the data. You would
have been able to get most of
the way there through that
fiscal.wa.gov site, but there
was some stimulus money
you wouldn’t have captured
and we included the changes
so far to the current
biennium (based on the
supplemental the legislature
approved in December).

Other Key Data Checks

– When you update
the data, make sure
nothing has changed.
Check definitions for
expansion or
reduction and talk to
the creator of the
data.
– Be ready to nix a
story.


– Do the math: run sums, percent change, other
calculations. Test that math against the results in
the database – do they match?
– Look for unexpected nulls
– Run a group by query and sort alphabetically by
major fields to test for misspellings or other
categorization errors.
– If your data should include every city, or every
county in the state, does it? Are you missing data?


– Check with experts and have them test your
analysis. Research the methodology used with the
kind of data you are working with.
– There is version control for Web frameworks – use
some kind of version control for your database,
even if it’s in an Excel spreadsheet. Any time you
change it, log what you did and when and why.

– Test the data against source documents.

• How we did it

• Pre-plan • Acquire latest data and
2nd monitor related docs

NOW you are ready to
“Logbook” apps
• Do tables conform to record
• Lit. review/ interview peers layout?

write a story•Do docs&specifyon
• Do data fit theoretical
models?
based expected
ranges frequencies?
a data base!values missing or
• Do a “critical biography” of
the data
• Are data
out of range?
• Does biography raise critical • Review major checklist
warnings?
• Have others run analysis of Analysis
this data?

Summing Up

• Databases are constantly dynamic, “living” things.
Look for and measure their energy and change.
• Beware of rounding error
– Always try to get the most fine-grained data possible in its
ORIGINAL data form or application, i.e. avoid PDFs with
SUMMARY data
• Beware of changing definitions
• Beware of changing data collectors, data entry
personnel, changing norms of editing and usage.

“OK, but where did that data come from?”

Many Thanks
This PowerPoint deck and Tipsheets posted at:

http:// s d r v . m s / w N t i M 7
Tom Johnson Cheryl Phillips
Managing Director Data Enterprise Editor
Inst. for Analytic Journalism Seattle Times
Santa Fe, New Mexico USA Seattle, Washington USA
tom@jtjohnson.com
cphillips@seattletImes.com
25

Data validation in the Digital Age

More Related Content

Similar to Data validation in the Digital Age

More from J T "Tom" Johnson

Recently uploaded

Data validation in the Digital Age

Editor's Notes