Árbol de vida de los datos(Data validation in the Digital Age)  Tom Johnson  Managing Director  Inst. for Analytic Journal...
Data validation in theDigital Age   Presentation by Tom Johnson at   Cátedra Walter Lippmann de Periodismo y Opinión Públi...
Impt. Point 1-You know more than I do                                        Important point            Each of you know  ...
DataSet--Story                 The                 STORY!                      4
DataSet--CollectionProcess             0100111010100101Collection   0010001010101001                                The   ...
DataSet-ValidationProcess              0100111010100101Collection    0010001010101001                                   Th...
Paying the price of bad dataIllinois and Missouri sex-offender DB• “St. Louis Post-Dispatch - 2 May 1999: A11 –  “ABOUT 70...
How bad data can do you wrong2011 - New Mexico Sec. of State’s “questionablevoters” data set – “The Big Bundle”• ~1.1m vot...
Problems with Sec. of State methodology• What is the error rate of original DB?   • Definition of “error”? (Gonzales or Go...
DataSetCollectionProcess             0100111010100101Collection   0010001010101001                                The    ...
Data sets are living things; they have pedigree and genealogy                                    Important point    •Most ...
Data sets are living things; they have pedigree and genealogy                                    Important point          ...
Types of Data   0100111010100101                0010001010101001                0010100101001010                1010010010...
•DataQuality=FunctionOf…  Data Quality = function of…  • Objectives, reputation of data-base    creator  • Validity and pr...
Pyramid of significance• How to judge whether some data – and its  potential stories -- are more trustworthy  than others?...
Learn from Librarians• Evaluating Web Pages: Techniques to  Apply & Questions to Ask http://www.lib.berkeley.edu/TeachingL...
Learn from Librarians• Does it all add up?  • Why was the page put on the web?     •   Inform, give facts, give data?     ...
Hierarchy of Trust                                                                     • ".org" is organization. Sites tha...
Hierarchy of Trust                     • Credible websites should list                       contact information and      ...
Hierarchy of Trust                     • Internet pages that have                       been published more               ...
Hierarchy of Trust                     • Selling something?                     • Asking you to sign up                   ...
Hierarchy of Trust    Probably reliable sites,      but not necessarily         reliable data           What is the       ...
CollectionProcessDataSet             0100111010100101Collection   0010001010101001                                The    ...
Precess of Data Evaluation    1. Pre-      2. Lit. review/      3. Do data fit   planning      interview peers        theo...
Precess of Data Evaluation4. Do a “critical   5. Does              6. Have others   biography” of    biography raise      ...
Precess of Data Evaluation7. Acquire latestdata and relateddocumentation- Get data  schema & code  sheet- Get instructions...
Process of DB evaluation    Ask for copy      Computer    ofData-Entry       DATA    ENTRY form & Explanation        Sheet...
Precess of Data Evaluation7. Acquire latest    8. Compare            9. Do documents specifydata and related     record la...
Precess of Data Evaluation10. Are data values    missing or out of    range?- Use Excel (or R) formula  to test “expected”...
Process of DB evaluation     Major questions - Revise your list of major checkpoints10. Review major checklist     • Are t...
Is perfection necessary?• How “clean” must the data be?• Depends on the goals – and scale -- of the  analysis  • How impor...
Data Quality checkpoints• Constancy of definitions and coding  categories?• Completeness:  • How many records have unfille...
COMMON VERIFICATION METHODS• Counting Do you have the number of records indicated/promised?• If >1,000 records, sample to ...
Data Quality Examples
ScatterPlots+BoxPlots    Box Plots                        35
What is a scatterplot?• Scatterplot is often 1st step in  analysis• Examine relationship between  the variables; determine...
Convergence of Data Quality with Data Veracity  What is the difference?  • Data quality is the responsibility of who    or...
Resources• Free  • Power Pivot – Excel 2010 add-on for working with large data sets  • R – free software environment for s...
Resources• Open Source  • Flat File Checker - a simple, intuitive tool for validation of    structured data in flat files ...
ResourcesProfessional disciplines and organizations  • International Association for Information and Data    Quality  • DA...
Contabilidad Forense                       41
RecursosDisciplinas profesionales, organizaciones y otros• La Contabilidad o Auditoria Forense: un conocimiento  básico en...
Árbol de vida                de los datos                        (Data validation in the Digital Age)Tom JohnsonManaging D...
Upcoming SlideShare
Loading in …5
×

Tom johnson datavalidity-eng-nov21-arbol

641 views

Published on

Lecture presented at Catedra Walter Lippmann, Universidad del Rosario, Bogota, Colombia, 23 Nov. 2012
See: http://issuu.com/consejo_de_redaccion/docs/ur_-_semana_-_seminario_walter_lippmann_2012_2

Published in: Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
641
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Tom johnson datavalidity-eng-nov21-arbol

  1. 1. Árbol de vida de los datos(Data validation in the Digital Age) Tom Johnson Managing Director Inst. for Analytic Journalism Santa Fe, New Mexico USA tom@jtjohnson.com @jtjohnson 1
  2. 2. Data validation in theDigital Age Presentation by Tom Johnson at Cátedra Walter Lippmann de Periodismo y Opinión Pública Claustro de la Universidad Universidad del Rosario, Bogota, Colombia Date/Time: 22 November 2012 This PowerPoint deck and Tipsheets posted at: http:// s d r v . m s / w N t i M 7 2
  3. 3. Impt. Point 1-You know more than I do Important point Each of you know more about some aspect of insuring data quality than I do. 3
  4. 4. DataSet--Story The STORY! 4
  5. 5. DataSet--CollectionProcess 0100111010100101Collection 0010001010101001 The 0010100101001010Process 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 STORY! 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 0100010101010110 1101010010100101 5
  6. 6. DataSet-ValidationProcess 0100111010100101Collection 0010001010101001 The 0010100101001010Process 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 STORY! 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 Validation 0100010101010110 1101010010100101 Process [6]
  7. 7. Paying the price of bad dataIllinois and Missouri sex-offender DB• “St. Louis Post-Dispatch - 2 May 1999: A11 – “ABOUT 700 SEX OFFENDERS DO NOT APPEAR TO LIVE AT THE ADDRESSES LISTED ON A ST. LOUIS REGISTRY; MANY SEX OFFENDERS NEVER MAKE THE LIST” By Reese Dunklin; Data Analysis By David Heath and Julie Luca• Sun, 3 Oct 2004 - THE DALLAS MORNING NEWS - PAGE-1A “Criminal checks deficient; States database of convictions is hurt by lack of reporting, putting public safety at risk, law officials say” By Diane Jennings and Darlean Spangenberger
  8. 8. How bad data can do you wrong2011 - New Mexico Sec. of State’s “questionablevoters” data set – “The Big Bundle”• ~1.1m voters• Previous Sec. of State didn’t clean rolls• Matched name, address, DoB and SS# • SSA data base; NM driver’s licenses • 2 variables “mismatch” =  Questionable? • Asked State Police (not AG’s office) to investigate 8
  9. 9. Problems with Sec. of State methodology• What is the error rate of original DB? • Definition of “error”? (Gonzales or Gonzalez) • Sample(s) by county and state total? • Error rates of comparative DBs? • Aggregation of error problem• 2011 Help America Vote Verification Transaction Totals, Year-to-Date, by State https://www.socialsecurity.gov/open/havv/havv- year-to-date-2011.html
  10. 10. DataSetCollectionProcess 0100111010100101Collection 0010001010101001 The 0010100101001010Process 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 STORY! 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 0100010101010110 1101010010100101 10
  11. 11. Data sets are living things; they have pedigree and genealogy Important point •Most [all?] data sets are living things. •And they have a pedigree, a genealogy, an “árbol de vida”. •Data sets live in a dynamic environment. •Understand the DB ecology 11
  12. 12. Data sets are living things; they have pedigree and genealogy Important point • NEVER work with your original data set; always a copy of the file(s) • More combined data sets = greater chance of error • Larger data sets = greater chance of error 12
  13. 13. Types of Data 0100111010100101 0010001010101001 0010100101001010 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 0100010101010110 1101010010100101 13
  14. 14. •DataQuality=FunctionOf… Data Quality = function of… • Objectives, reputation of data-base creator • Validity and precision of the collection/creation process – and resulting data• Statistical Data? • Primary Data (collected, managed by agency or individual) • Secondary (Agency or individual is using someone else’s “primary” data) [14]
  15. 15. Pyramid of significance• How to judge whether some data – and its potential stories -- are more trustworthy than others? • Go back to librarians’ hierarchy of trusted sources when searching? (Has anyone tested the “quality” of data sets from those strata of sources? If not, a good research project.) [15]
  16. 16. Learn from Librarians• Evaluating Web Pages: Techniques to Apply & Questions to Ask http://www.lib.berkeley.edu/TeachingLib/Guides/Internet /Evaluate.html • What can the URL tell you? • Gov’t agency? Scholarly? Interest Group? Individual? • Has a reputation for accuracy been created over time? [16]
  17. 17. Learn from Librarians• Does it all add up? • Why was the page put on the web? • Inform, give facts, give data? • Explain, persuade? • Sell, entice? • Share? • Disclose?• Is the information current? When was it last updated and by whom?• If the data is available on other sites, who/what was the original creator and editor of the data? [17]
  18. 18. Hierarchy of Trust • ".org" is organization. Sites that end in .org are usually non- profit organizations. • For .gov, .edu, or • Can be very good sources or .mil, probably the very poor sources; take care to information has been research their possible agendas or political biases. was posted. vetted before it • “.net” means network. .edu Websites with .gov, and .mil have to be applied “.info” is the Internet’s first for, and their use is unrestricted top-level domain controlled. since .COM. There are no • restrictions on whothey register It doesn’t mean may are .INFO names.though. fool-proof .INFO was created for general use around the world. Source: http://www.morriscs.org/webpages/jwaffle/index.cfm?subpage=1317299
  19. 19. Hierarchy of Trust • Credible websites should list contact information and resources. • If only cell phones and PO boxes = suspicion • If the author is named, find his/her web page to… • Verify educational credits • Discover if the writer is either published in a scholarly journal • Verify that the writer is employed by a research institution or university
  20. 20. Hierarchy of Trust • Internet pages that have been published more recently are usually more credible. • Find this information at the bottom of a website; in the "about us“; or “view page source”
  21. 21. Hierarchy of Trust • Selling something? • Asking you to sign up for something? • May not be presenting you with neutral, unbiased information.
  22. 22. Hierarchy of Trust Probably reliable sites, but not necessarily reliable data What is the sites purpose? Check the publishing date. Credible Websites/Authors Check the domain of the URL
  23. 23. CollectionProcessDataSet 0100111010100101Collection 0010001010101001 The 0010100101001010Process 1010010010100010 1010100100111010 Data 1001010010001010 1010010010100101 STORY! 0010101010010010 1000101010100100 Set 1110101001010010 0010101010010010 1001010010101010 0100101000101110 1010010010101010 1001001010011101 0100101001000101 0101001001010010 1001010101001001 0100010101010110 1101010010100101 23
  24. 24. Precess of Data Evaluation 1. Pre- 2. Lit. review/ 3. Do data fit planning interview peers theoretical models?• 2nd Monitor • Nothing is new; - Depends on subject:• “Logbook” everything has a traffic flow vs. Crime or (bitácora) precedent educational level vs. apps • How have Income• Checklist of others attacked - Sometimes good to use intended this problem? non-trad. models: steps Crime and disease 24
  25. 25. Precess of Data Evaluation4. Do a “critical 5. Does 6. Have others biography” of biography raise run analysis of the data critical warnings? this data?- Why was data - Have laws - Not only collected? Who related to data journalists, but other ordered its remained the agencies/people creation (law? same? Agency? - Have Individual?)- When first definitions collected? remained the- News stories same? about the data? 25
  26. 26. Precess of Data Evaluation7. Acquire latestdata and relateddocumentation- Get data schema & code sheet- Get instructions to data collectors and data entry clerks 26
  27. 27. Process of DB evaluation Ask for copy Computer ofData-Entry DATA ENTRY form & Explanation Sheet Sheet Codes Data Data base schema sheet 27
  28. 28. Precess of Data Evaluation7. Acquire latest 8. Compare 9. Do documents specifydata and related record layout to expected ranges &documentation tables frequencies?- Get data This may tell you: - Suggests variables to be schema & code - What data found. If expected sheet you did not range is 1-7 and you- Get instructions receive find 8… to data - Possibly, what collectors and data is data entry feeding into clerks other variables or calculations 28
  29. 29. Precess of Data Evaluation10. Are data values missing or out of range?- Use Excel (or R) formula to test “expected” ranges - =MIN(A1:A100) or =MAX(A1:A100)- Use Excels conditionalformatting feature 29
  30. 30. Process of DB evaluation Major questions - Revise your list of major checkpoints10. Review major checklist • Are there changes in definitions • Changed by law? • By the administrators? • Formal or informal by data entry process? • Are there changes in the collection methods, data entry, editing of data, quality checking, and the type and form of files? • Were there changes in the users and the use of the data? • Now it is time to clean the data 30
  31. 31. Is perfection necessary?• How “clean” must the data be?• Depends on the goals – and scale -- of the analysis • How important is the actual age of an individual? Or… • How precise should be the lat/longitude data?• Precision: Are the numbers rounded or? • Hope for fine-grained, not summaries or aggregates • Can be especially important with temporal and geographic data, i.e. What is the range(s) of the time scales? 31
  32. 32. Data Quality checkpoints• Constancy of definitions and coding categories?• Completeness: • How many records have unfilled cells? • Are the tendencies of “nulls” consistent in all records, variable types?
  33. 33. COMMON VERIFICATION METHODS• Counting Do you have the number of records indicated/promised?• If >1,000 records, sample to test • To confirm your mythology• Proportion of completed fields • If a record has X fields, what % of records are complete? • Are there trends of null (empty) fields?• Draw on many Excel functions: • COUNTIFs or SUMIF 33
  34. 34. Data Quality Examples
  35. 35. ScatterPlots+BoxPlots Box Plots 35
  36. 36. What is a scatterplot?• Scatterplot is often 1st step in analysis• Examine relationship between the variables; determine if there are any problems/issues with the data• Scatterplot indicates anything unique or interesting about the data, such as: • How is the data dispersed? • Are there outliers? A scatterplot is useful for "eyeballing" the presence of outliers. 36
  37. 37. Convergence of Data Quality with Data Veracity What is the difference? • Data quality is the responsibility of who or what agency is collecting or creating the data set This suggests questions journalists should ask about DQ Do methodologies differ?
  38. 38. Resources• Free • Power Pivot – Excel 2010 add-on for working with large data sets • R – free software environment for statistical computing and graphics • Shiny – Lets R users turn analyses into interactive web applications • Google Refine - tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to database • Google Fusion Tables - an experimental data visualization web application to gather, visualize, and share larger data tables. • Tableau Public - Interact with the data, download it, or create visualizations of it • Junar - cloud-based platform for opening data
  39. 39. Resources• Open Source • Flat File Checker - a simple, intuitive tool for validation of structured data in flat files (*.txt, *.csv, etc.). • Shiny – Lets R users turn analyses into interactive web applications• Excel add-ons• Commercial Companies & Products • Techspeed Data Cleansing • SAS® Data Quality Advanced
  40. 40. ResourcesProfessional disciplines and organizations • International Association for Information and Data Quality • DAMA International • Forensic Accounting/ Performance Measurement • National Association of Forensic Accountants (NAFA) • Certified Fraud Examiner (CFE) • International Forensic Accounting Association • Forensic Accountants Society of North America • International City/County Management Association
  41. 41. Contabilidad Forense 41
  42. 42. RecursosDisciplinas profesionales, organizaciones y otros• La Contabilidad o Auditoria Forense: un conocimiento básico en Colombia• Contabilidad Forense: ¿El lado sexy de la Contaduría?• La Contabilidad Forense• Contabilidad Forense, una herramienta que busca la verdad• Aplicación del Derecho a la Contabilidad Forense: La práctica indagatoria contra el delito económico
  43. 43. Árbol de vida de los datos (Data validation in the Digital Age)Tom JohnsonManaging DirectorInst. for Analytic JournalismSanta Fe, New Mexico USAtom@jtjohnson.com@jtjohnson 43

×