Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data validation in the Digital Age

16,856 views

Published on

The data sets you are about to analyze are only as good and valid as the methodology used to gather the data and create the data set. The presentation by Tom Johnson and Cheryl Phillips was made at the 2012 meeting of the National Institute for Computer-Assisted Reporting, Feb. 2012, in St. Louis.

Published in: News & Politics, Technology
  • Be the first to comment

Data validation in the Digital Age

  1. “OK, but where did that data come from?” Data validation in the Digital AgeTom Johnson Cheryl PhillipsManaging Director Data Enterprise EditorInst. for Analytic Journalism Seattle TimesSanta Fe, New Mexico USA Seattle, Washington USAtom@jtjohnson.com cphillips@seattletImes.com 1
  2. Data validation in the Digital AgePresentation by Cheryl Phillips and Tom Johnson atNational Institute of Computer-Assisted Reporting ConferenceDate/Time: Friday, Feb. 24 at 11 a.m.Location: Frisco/Burlington RoomSt. Louis, Missouri USAThis PowerPoint deck and Tipsheets posted at:http:// s d r v . m s / w N t i M 7 2
  3. The methodology / = the value of the data set and your story 1 Important point A data base (or report) is only as good as the methodology used to create it. 3
  4. 2Data sets are living things; they have pedigree and genealogy Important points •Most [all?] data sets are living things. •And they have a pedigree, a genealogy. •Data sets live in a dynamic environment. •Understand the DB ecology 4
  5. How bad data can do you wrongIllinois and Missouri sex-offender DB•“St. Louis Post-Dispatch - 2 May 1999: A11 – “ABOUT 700 SEXOFFENDERS DO NOT APPEAR TO LIVE AT THE ADDRESSESLISTED ON A ST. LOUIS REGISTRY; MANY SEX OFFENDERS NEVERMAKE THE LIST” By Reese Dunklin; Data Analysis By David Heath and JulieLuca•Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A“Criminal checks deficient; States database of convictions ishurt by lack of reporting, putting public safety at risk, lawofficials say” By Diane Jennings and Darlean Spangenberger•See stories here
  6. How bad data can do you wrong2011 - New Mexico Sec. of State’s “questionablevoters” data set – “The Big Bundle”•~1.1m voters•Previous SoS didn’t clean rolls•Matched name, address, DoB and SS# – SSA data base; NM driver’s licenses – 2 variables “mismatch” =  Questionable? – Asked State Police (not AG’s office) to investigate
  7. Problems with Sec. of State methodology• What’s the error rate of original DB? • Definition of “error”? (Gonzales or Gonzalez) • Sample(s) by county and state total? • Error rates of comparative DBs? • Aggregation of error problem• 2011 Help America Vote Verification Transaction Totals, Year-to-Date, by State https://www.socialsecurity.gov/open/havv/havv-year-
  8. Source: https://www.socialsecurity.gov/open/havv/havv-year-to-date-2011.html
  9. There be dragons! A mostData base wonderfulrich with story!!!potential 9
  10. Building genealogy for target DB1. Pre-plan 1. Acquire latest data and •2nd monitor related docs •“Logbook” apps 1. Do tables conform to1. Lit. review/ interview peers record layout?1. Do data fit theoretical 1. Do docs specify expected models? ranges & frequencies?1. Do a “critical biography” of 1. Are data values missing or the data out of range?1. Does biography raise 1. Review major checklist critical warnings?1. Have others run analysis of this data?Source: Palmer, Griff. “Flowchart/decision tree for data base analysis.” pgs. 136-146. Ver 1.0 Proceedings, IAJ Press (Santa Fe,NM), April 2006. http://www.lulu.com/product/paperback/ver-10-workshop-proceedings/546459
  11. Building genealogy for target DB1. Pre-plan 1. Acquire latest data and• Changes in •2nd monitor related docs definitions? •“Logbook” apps 1. Do tables conform to • review/ interview peers1. Lit. By administrators? record layout? • Formal or informal?1. Do By statute? • data fit theoretical 1. Do docs specify expected models? ranges & frequencies?• Changes in collection1.methods, data entry, Do a “critical biography” of 1. Are data values missing or the data out of range? vetting, updating, file1.type/format?raise Does biography 1. Review major checklist critical warnings?• Changes in users and1.usage Have others run analysis of this data?• Data cleaning
  12. Data Quality checkpoints• Constancy of definitions and coding categories? • All at same time and location?• Completeness: How many records have unfilled cells? Are the tendencies of “nulls” consistent in all records, variable types?• Precision: Are the numbers rounded or? • Hope for fine-grained, not summaries or aggregates • Can be especially important with temporal and geographic data, i.e. What is the range(s) of the time scales?
  13. Cheryl on Quant methods for measuring data quality
  14. Data Quality checkpoints• Constancy of definitions and coding categories? • All at same time and location?• Completeness: How many records have unfilled cells? Are the tendencies of “nulls” consistent in all records, variable types?• Precision: Are the numbers rounded or? • Hope for fine-grained, not summaries or aggregates • Can be especially important with temporal and geographic data, i.e. What is the range(s) of the time scales?
  15. Newsroom methods for measuring data quality• Test frequencies on key fields Bicycle accidents in Seattle included a time field. But it was almost always noon when accidents occurred. Caveat: Don’t over-reach with your conclusions or analysis
  16. Don’t over-reach with your analysis– Rates are good – IF you have the data to calculate them.
  17. Outliers are important Explore the reasons behind anomalies or unexpected trends in the data.From the state of WA: Aftergoing back and forth with ouranalyst on this, we decided itwould be easiest for her tojust pull the data. You wouldhave been able to get most ofthe way there through thatfiscal.wa.gov site, but therewas some stimulus moneyyou wouldn’t have capturedand we included the changesso far to the currentbiennium (based on thesupplemental the legislatureapproved in December).
  18. Other Key Data Checks – When you update the data, make sure nothing has changed. Check definitions for expansion or reduction and talk to the creator of the data. – Be ready to nix a story.
  19. Other Key Data Checks– Do the math: run sums, percent change, other calculations. Test that math against the results in the database – do they match?– Look for unexpected nulls– Run a group by query and sort alphabetically by major fields to test for misspellings or other categorization errors.– If your data should include every city, or every county in the state, does it? Are you missing data?
  20. Other Key Data Checks– Check with experts and have them test your analysis. Research the methodology used with the kind of data you are working with.– There is version control for Web frameworks – use some kind of version control for your database, even if it’s in an Excel spreadsheet. Any time you change it, log what you did and when and why.
  21. Other Key Data Checks– Test the data against source documents.
  22. Other Key Data Checks • How we did it
  23. Building genealogy for target DB• Pre-plan • Acquire latest data and 2nd monitor related docs NOW you are ready to “Logbook” apps • Do tables conform to record• Lit. review/ interview peers layout? write a story•Do docs&specifyon• Do data fit theoretical models? based expected ranges frequencies? a data base!values missing or• Do a “critical biography” of the data • Are data out of range?• Does biography raise critical • Review major checklist warnings?• Have others run analysis of Analysis this data?
  24. Summing Up• Databases are constantly dynamic, “living” things. Look for and measure their energy and change.• Beware of rounding error – Always try to get the most fine-grained data possible in its ORIGINAL data form or application, i.e. avoid PDFs with SUMMARY data• Beware of changing definitions• Beware of changing data collectors, data entry personnel, changing norms of editing and usage.
  25. “OK, but where did that data come from?” Many Thanks Data validation in the This PowerPoint deck and Tipsheets posted at: http:// s d r v . m s / w N t i M 7Tom Johnson Cheryl PhillipsManaging Director Data Enterprise EditorInst. for Analytic Journalism Seattle TimesSanta Fe, New Mexico USA Seattle, Washington USAtom@jtjohnson.com cphillips@seattletImes.com 25

×