Get your datas history• Know the source of the data• Know how its used• Know what all the fields mean• Know what other stories havebeen done with it
What is dirty data?• Missing records• Incorrect information• Duplicate information• No standardization
Take your datastemperature• How many records should you have?• Double-check totals or counts. Check forstudies/ summary reports.• Check for duplicates. Make sure they arereal duplicates. Is it possible that there arehidden duplicates?• Consistency-check all fields. Are allcity/county names spelled the same? Areall codes found within documentation?
Internal consistencychecks• Is there more money going to sub-contractors than went tothe prime contractor?• Are there more teachers than students?• How about other important fields?• Check the range of fields. (For example, check for DOBsthat would make people too old or too young.)• Check for missing data or blank fields. Are they real values,or did something happen with an import or append query?
External Checks• Compare to reports• Data reported to other agencies• On the ground reporting• Verification from sources
Steps for cleaning data• Assess the problem• Identify your goal• Find the right tool for the job• Set aside time (double what you think)• Make a backup copy• Make a backup copy• Never alter the original data. Make newcolumns so you can compare and showyour work.• Create an audit trail.• Spot check as you go.
Tips for success• Keep a data notebook• Duplicate your work• Duplicate your work• Bounce your results off folks who really knowthe data• Set up some standards for yourwork/newsroom
Choose the righttool• You dont need to be fancy, just get the job done• Work with what youre comfortable with• Dont forget the power of Excel• Text editors can be lifesavers• Many tools exist - Open Refine, programming, etc.• Get training as needed
Inoperable data: Pain management• Explain caveats• Choose your wording carefully• Know when to leave out records• Be transparent• Know what questions can and cant beanswered with this dataset• Know when to get more information
Continue learning about dirty data: Sat. 3:40 p.m.Conference Room 11BYOD (Bring your own data): Sat. 4:50 p.m.,Conference Room 11Get your hands dirty