0
DiagnosingDirty DataJaimi Dowdell, IRE/NICARJennifer LaFleur, ProPublica
Get your datas history• Know the source of the data• Know how its used• Know what all the fields mean• Know what other sto...
What is dirty data?• Missing records• Incorrect information• Duplicate information• No standardization
Take your datastemperature• How many records should you have?• Double-check totals or counts. Check forstudies/ summary re...
Internal consistencychecks• Is there more money going to sub-contractors than went tothe prime contractor?• Are there more...
External Checks• Compare to reports• Data reported to other agencies• On the ground reporting• Verification from sources
Steps for cleaning data• Assess the problem• Identify your goal• Find the right tool for the job• Set aside time (double w...
Tips for success• Keep a data notebook• Duplicate your work• Duplicate your work• Bounce your results off folks who really...
Choose the righttool• You dont need to be fancy, just get the job done• Work with what youre comfortable with• Dont forget...
Focus is important
So get plentyof food and rest
Get a databuddy
Common ailments
Dates that arent dates
Names, names, names...
Location matters
Leading and trailing spaces
"Pretty" reports
Inoperable data: Pain management• Explain caveats• Choose your wording carefully• Know when to leave out records• Be trans...
Continue learning about dirty data: Sat. 3:40 p.m.Conference Room 11BYOD (Bring your own data): Sat. 4:50 p.m.,Conference ...
Jennifer.lafleur@propublica.org (@j_la28)jaimi@ire.org (@jaimidowdell)
Questions?
Upcoming SlideShare
Loading in...5
×

Diagnosing dirty data_ire2013

338

Published on

Diagnosing dirty data - IRE 2013 (including cat photos)

Published in: Technology, News & Politics
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
338
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Diagnosing dirty data_ire2013"

  1. 1. DiagnosingDirty DataJaimi Dowdell, IRE/NICARJennifer LaFleur, ProPublica
  2. 2. Get your datas history• Know the source of the data• Know how its used• Know what all the fields mean• Know what other stories havebeen done with it
  3. 3. What is dirty data?• Missing records• Incorrect information• Duplicate information• No standardization
  4. 4. Take your datastemperature• How many records should you have?• Double-check totals or counts. Check forstudies/ summary reports.• Check for duplicates. Make sure they arereal duplicates. Is it possible that there arehidden duplicates?• Consistency-check all fields. Are allcity/county names spelled the same? Areall codes found within documentation?
  5. 5. Internal consistencychecks• Is there more money going to sub-contractors than went tothe prime contractor?• Are there more teachers than students?• How about other important fields?• Check the range of fields. (For example, check for DOBsthat would make people too old or too young.)• Check for missing data or blank fields. Are they real values,or did something happen with an import or append query?
  6. 6. External Checks• Compare to reports• Data reported to other agencies• On the ground reporting• Verification from sources
  7. 7. Steps for cleaning data• Assess the problem• Identify your goal• Find the right tool for the job• Set aside time (double what you think)• Make a backup copy• Make a backup copy• Never alter the original data. Make newcolumns so you can compare and showyour work.• Create an audit trail.• Spot check as you go.
  8. 8. Tips for success• Keep a data notebook• Duplicate your work• Duplicate your work• Bounce your results off folks who really knowthe data• Set up some standards for yourwork/newsroom
  9. 9. Choose the righttool• You dont need to be fancy, just get the job done• Work with what youre comfortable with• Dont forget the power of Excel• Text editors can be lifesavers• Many tools exist - Open Refine, programming, etc.• Get training as needed
  10. 10. Focus is important
  11. 11. So get plentyof food and rest
  12. 12. Get a databuddy
  13. 13. Common ailments
  14. 14. Dates that arent dates
  15. 15. Names, names, names...
  16. 16. Location matters
  17. 17. Leading and trailing spaces
  18. 18. "Pretty" reports
  19. 19. Inoperable data: Pain management• Explain caveats• Choose your wording carefully• Know when to leave out records• Be transparent• Know what questions can and cant beanswered with this dataset• Know when to get more information
  20. 20. Continue learning about dirty data: Sat. 3:40 p.m.Conference Room 11BYOD (Bring your own data): Sat. 4:50 p.m.,Conference Room 11Get your hands dirty
  21. 21. Jennifer.lafleur@propublica.org (@j_la28)jaimi@ire.org (@jaimidowdell)
  22. 22. Questions?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×