Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Diagnosing dirty data_ire2013


Published on

Diagnosing dirty data - IRE 2013 (including cat photos)

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

Diagnosing dirty data_ire2013

  1. 1. DiagnosingDirty DataJaimi Dowdell, IRE/NICARJennifer LaFleur, ProPublica
  2. 2. Get your datas history• Know the source of the data• Know how its used• Know what all the fields mean• Know what other stories havebeen done with it
  3. 3. What is dirty data?• Missing records• Incorrect information• Duplicate information• No standardization
  4. 4. Take your datastemperature• How many records should you have?• Double-check totals or counts. Check forstudies/ summary reports.• Check for duplicates. Make sure they arereal duplicates. Is it possible that there arehidden duplicates?• Consistency-check all fields. Are allcity/county names spelled the same? Areall codes found within documentation?
  5. 5. Internal consistencychecks• Is there more money going to sub-contractors than went tothe prime contractor?• Are there more teachers than students?• How about other important fields?• Check the range of fields. (For example, check for DOBsthat would make people too old or too young.)• Check for missing data or blank fields. Are they real values,or did something happen with an import or append query?
  6. 6. External Checks• Compare to reports• Data reported to other agencies• On the ground reporting• Verification from sources
  7. 7. Steps for cleaning data• Assess the problem• Identify your goal• Find the right tool for the job• Set aside time (double what you think)• Make a backup copy• Make a backup copy• Never alter the original data. Make newcolumns so you can compare and showyour work.• Create an audit trail.• Spot check as you go.
  8. 8. Tips for success• Keep a data notebook• Duplicate your work• Duplicate your work• Bounce your results off folks who really knowthe data• Set up some standards for yourwork/newsroom
  9. 9. Choose the righttool• You dont need to be fancy, just get the job done• Work with what youre comfortable with• Dont forget the power of Excel• Text editors can be lifesavers• Many tools exist - Open Refine, programming, etc.• Get training as needed
  10. 10. Focus is important
  11. 11. So get plentyof food and rest
  12. 12. Get a databuddy
  13. 13. Common ailments
  14. 14. Dates that arent dates
  15. 15. Names, names, names...
  16. 16. Location matters
  17. 17. Leading and trailing spaces
  18. 18. "Pretty" reports
  19. 19. Inoperable data: Pain management• Explain caveats• Choose your wording carefully• Know when to leave out records• Be transparent• Know what questions can and cant beanswered with this dataset• Know when to get more information
  20. 20. Continue learning about dirty data: Sat. 3:40 p.m.Conference Room 11BYOD (Bring your own data): Sat. 4:50 p.m.,Conference Room 11Get your hands dirty
  21. 21. (@j_la28) (@jaimidowdell)
  22. 22. Questions?